Fans of President Barack Obama, or perhaps virtuous those who dislike former President George W. Bush, seem to think there's something notable about the way the new Colorless House Web site is organized t o deal with operation engines.
That configuration file is called robots.txt. It's designed to let Webmasters ask search engine robots not to include certain areas of a Web site in their index. Well-behaved robots testament comply.
The Obama revamp of Whitehouse.gov inc luded a shorter robots.txt file, which Thenextweb.com called "a sign of greater transpar ency and change." A BoingBoing poster claimed that now "people can find information that was restricted before." And so on.
There's just one problem with these comments . They're misconduct. As of Tuesday morning, the Bush administration's robots.txt file did only two things: first, it pointed search engines to the high-graphics versions of the page, as opposed to the text-only versions, and second, it tried to hold type-in-your-search-query pages from being indexed.
Those are legitimate reasons to list those pages in robots.txt, which is why CNET's possess fil e is relatively long and complicated too. (Sites that have been around for eight years or longer tend to ache that way). We ask search engines not to index an "/Ads" directory, e-mail-this-story pages, and dozens of others. The Democrat-controlled < a href="http://www.house.gov/robots.txt">House and Senate have--gasp!--substantial robots.txt files too.
It's true that in 2007, the Bush White House did block some files they should not have , which they fixed once I brought it to their attention. They also fixed a writer serious problem with the Director of National Intelligence's Web site, and an earli er problem in 2003. (A better solution would be for search engines to ignore overly broad robots.txt files on .gov and .mil sites, including Thomas.loc.gov.)
If anything, Obama's robo ts.txt file is too short. It doesn't currently block search pages, meaning they'll show up on search engines--something that most site operators don't deprivation and which runs afoul of Google's Webmaster guidelines. Those guidelines say: "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add mu ch value for users coming from search engines."
And here's something sure to upset Obama-praising geeks: the late White House site doesn't pass the litmus test of good HTML design. Alas, according to the W3C, not all pages successfully validate. Those are your strain dollars at work.
P.S.: Th e White House seems to be using Akamai's Edge Platform for scalable Web hosting:
sh-2.05b$ host whitehouse.gov whitehouse.gov has address 96.6.250.135 whitehouse.gov mail is handled by CARDINAL mailhub-wh3.whitehouse.gov. whitehouse.gov mail is handled by 100 mailhub-wh2.whitehouse.gov. sh-2.05b$ host www.whitehouse.gov www.whitehouse.gov is an name for www.whitehouse.gov.edgekey.net. www.whitehouse.gov.edgekey.net is an alias for e2561.b.akamaiedge.net. e2561.b.akamaiedge.net has address 96.16.218.135 sh-2.05b$Cheers~
No comments:
Post a Comment