- From: ryan <ryan@theryanking.com>
- Date: Tue, 21 Aug 2007 16:04:12 -0700
- To: "Roy T. Fielding" <fielding@gbiv.com>
- Cc: Ian Hickson <ian@hixie.ch>, public-html@w3.org
On Aug 21, 2007, at 3:38 PM, Roy T. Fielding wrote: > On Aug 21, 2007, at 2:32 PM, Ian Hickson wrote: >> I entirely agree with the above and in fact with most of you >> wrote, but >> just for the record, could you list some of the Web-based >> functionality >> you mention that depends on Content-Type to control behaviour? In my >> experience most non-browser based scripts and the like actually >> ignore >> Content-Type headers even more than browsers do, and it would be >> interesting to study the cases that actually honour them >> completely (or >> at least, that honour these headers more than browsers do). > > Mostly intermediaries, spiders, and the analysis engines that perform > operations on the results of both. Spiders typically limit their > traversals to known hypertext formats (using HEAD to determine the > content type before retrieval is even attempted), though there are > well-known exceptions to that (Google slurps everything, IIRC). It's not just google. As a spider engineer at Technorati I can say for certain that our spider sniffs content, rather than making content-type authoritative. AFAICT from parsing the logs of all of the sites I have access to (which is about 2M hits over the last 2 years), I don't see 1 HEAD request from google, yahoo, msn or ask, the 4 biggest search engines. This doesn't mean that they don't treat Content-Type headers as authoritative, it just means that they don't, as you claim, use HEAD to determine the type before GETing it. There are, of course, other bots that make HEAD requests, but I still want to make the point that *not all bots* behave the way you describe and, in fact, I'm actually working to make our bot behave more like a browser. -ryan
Received on Tuesday, 21 August 2007 23:04:34 UTC