Re: review of content type rules by IETF/HTTP community

On Aug 21, 2007, at 3:38 PM, Roy T. Fielding wrote:
> On Aug 21, 2007, at 2:32 PM, Ian Hickson wrote:
>> I entirely agree with the above and in fact with most of you  
>> wrote, but
>> just for the record, could you list some of the Web-based  
>> functionality
>> you mention that depends on Content-Type to control behaviour? In my
>> experience most non-browser based scripts and the like actually  
>> ignore
>> Content-Type headers even more than browsers do, and it would be
>> interesting to study the cases that actually honour them  
>> completely (or
>> at least, that honour these headers more than browsers do).
>
> Mostly intermediaries, spiders, and the analysis engines that perform
> operations on the results of both.  Spiders typically limit their
> traversals to known hypertext formats (using HEAD to determine the
> content type before retrieval is even attempted), though there are
> well-known exceptions to that (Google slurps everything, IIRC).

It's not just google. As a spider engineer at Technorati I can say  
for certain that our spider sniffs content, rather than making  
content-type authoritative.

AFAICT from parsing the logs of all of the sites I have access to  
(which is about 2M hits over the last 2 years), I don't see 1 HEAD  
request from google, yahoo, msn or ask, the 4 biggest search engines.  
This doesn't mean that they don't treat Content-Type headers as  
authoritative, it just means that they don't, as you claim, use HEAD  
to determine the type before GETing it.

There are, of course, other bots that make HEAD requests, but I still  
want to make the point that *not all bots* behave the way you  
describe and, in fact, I'm actually working to make our bot behave  
more like a browser.

-ryan

Received on Tuesday, 21 August 2007 23:04:34 UTC