Re: several messages about content sniffing in HTML from Julian Reschke on 2008-02-29 (public-html@w3.org from February 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Fri, 29 Feb 2008 11:54:14 +0100
To: Ian Hickson <ian@hixie.ch>
CC: Sam Ruby <rubys@intertwingly.net>, Robert Sayre <sayrer@gmail.com>, Anne van Kesteren <annevk@opera.com>, Sander Tekelenburg <st@isoc.nl>, Geoffrey Sneddon <foolistbar@googlemail.com>, ryan <ryan@theryanking.com>, Hugh Winkler <hughw@wellstorm.com>, Boris Zbarsky <bzbarsky@MIT.EDU>, Maciej Stachowiak <mjs@apple.com>, WHATWG <whatwg@whatwg.org>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <47C7E456.9050706@gmx.de>

Ian Hickson wrote:
> On Mon, 19 Nov 2007, Boris Zbarsky wrote:
>> Julian Reschke wrote:
>>> Multiple media-type values? What would that be good for?
>> Rendering the web?  In particular, it's not uncommon for servers (esp. 
>> when CGIs are involved) to produce things like:
>>
>>   Content-Type: text/html; charset=ISO-8859-1
>>   Content-Type: text/plain
>>
>> which then get normalized to:
>>
>>   Content-Type: text/html; charset=ISO-8859-1, text/plain
>>
>> Not sure where that normalization happens offhand (server end or Gecko 
>> end).
> 
> It seems like the HTTP spec should define how to handle that, but the HTTP 
> working group has indicated a desire to not specify error handling 
> behaviour, so I guess it's up to us.
> 
> IE and Safari use the first one, Firefox and Opera use the last one. I 
> guess we'll use the first one.

Isn't the fact that FF and IE disagree here an indication that this 
doesn't need to be specified?

>>> Content sniffing is a bug, and IMO we shouldn't mandate that these 
>>> bugs needn't be fixed.
> 
> Content sniffing is required to browser the Web. Interoperability is worth 
> far more than blind adherence to standards. In fact, interoperability is 
> exactly what adherence to standards is all about.

I note that it has been made optional in at least one case, which is good.

> On Fri, 25 Jan 2008, Boris Zbarsky wrote:
>> Those are sent with "Content-Encoding: gzip".  Due to an internal 
>> limitation, Gecko does not sniff such content at the moment (basically, 
>> because sniffing would involve undoing the content encoding first, since 
>> sniffing the gzipped data is pointless).  If the test sent the data 
>> without Content-Encoding (which is the usual situation for the cases the 
>> sniffing is designed to address), those tests would get sniffed as 
>> binary.
>>
>> Oh, and we really do plan to addres the gzip limitation at some point, 
>> just so things are consistent and people don't get confused as you did 
>> here...
> 
> Actually the spec right now requires that there be no content sniffing if 
> the Content-Encoding header is set... are you running into cases where 
> that is a problem?

Yes, by any means, do not specify sniffing when there's no proof it's 
needed.

> On Fri, 25 Jan 2008, Boris Zbarsky wrote:
>> One more thought I had about this today.  Is real reason the sniffing in 
>> the spec is a MUST because UAs must not do any sniffing other than 
>> what's specified?  If so, it might make more sense to say that as a MUST 
>> and say the existing sniffing stuff as a MAY.
> 
> That's what the spec says, as far as I can tell. (It allows several 
> aspects of the various sniffing requirements to be bypassed, but requires 
> that if it isn't, it be implemented as per the spec.)

Clarifying: this is what it says *now*, but it didn't say that back then.

> On Fri, 25 Jan 2008, Boris Zbarsky wrote:
>> Oh, one more note.  Gecko's sniffing behavior actually had to be changed 
>> recently.  Unfortunately, the more recent Apache installs changed from 
>> ISO-8859-1 to UTF-8 as the default encoding, without changing the 
>> default content type behavior.  So at this point, in Gecko, data flagged 
>> as "text/plain; charset=UTF-8" is also sniffed to see whether it might 
>> be binary. Since all of the byte values that trigger the "binary" 
>> determination are illegal in UTF-8, as far as I can tell, this shouldn't 
>> affect any actual UTF-8 text.  It might be a good idea to update the 
>> tests and the spec if people agree, though.
> 
> Uppercase only?

Roy pointed out (I think) that Apache's defaults did not change; so it 
must be some distributor/vendor causing this.

Where does it stop? Are you planning to add new special cases any time 
some Linux distro screws things up in a new way?

BR, Julian

Received on Friday, 29 February 2008 10:54:46 UTC