Re: HTML5 vs content type sniffing

Thanks for the feedback so far. Below is an attempt to summarize what's 
already been said, and also some more feedback from myself:

Related to the test cases themselves:

1) Content-Encoding vs sniffing

The tests at 
<http://www.hixie.ch/tests/adhoc/http/content-type/sniffing/> are 
somewhat broken; case 8 through 10 are supposed to trigger content 
sniffing (as per HTML5, 
<http://www.w3.org/TR/2008/WD-html5-20080122/#content-type-sniffing>), 
but don't, as the server sends the response with Content-Encoding: gzip 
(see <http://lists.w3.org/Archives/Public/public-html/2008Jan/0235.html>).

FF2 and FF3 beta currently do not implement sniffing in this case, 
matching what the spec says. Others apparently do. The fact that FF does 
not could be taken as an argument that it's not needed to "not break 
existing content".

2) Character sets vs sniffing

The spec currently requires sniffing for "text/plain; 
charset=iso-8859-1" and "text/plain; charset=ISO-8859-1", assuming that 
those servers that do send an incorrect default content type always send 
it with a very specific character set name. It appears that some servers 
sometimes ship with other defaults, thus more character sets would need 
to be considered 
(<http://lists.w3.org/Archives/Public/public-html/2008Jan/0239.html>). 
Where do you draw the line?

3) "illegal characters"

Some test cases, such as 16, claim the contents contains "invalid 
text/plain characters". At least case 16 doesn't. 
(<http://lists.w3.org/Archives/Public/ietf-http-wg/2008JanMar/0122.html>)

4) other type of sniffing

HTML5 defines other types of sniffing (such as unknown -> PDF) that 
aren't covered by these tests, and haven't been discussed within this 
thread.


Related to the topic of content sniffing in general:

5) content-type default

It seems in general Apache httpd is blamed for having caused the 
original problem (content being served with wrong default content-type 
instead of no content-type at all). In the meantime, httpd supports a 
default type of "none" 
(<http://lists.w3.org/Archives/Public/public-html/2008Jan/0258.html>), 
so at least the right steps have been made to get rid of the problem in 
the future.

6) conflict with Webarch and TAG finding

The current text in HTML5 contradicts WebArch 
(<http://www.w3.org/TR/webarch/#error-handling>) and the TAG finding 
"mime respect", in particular "avoid silent recovery" 
(<http://www.w3.org/2001/tag/doc/mime-respect.html#silent-recovery>).

There seems to be broad agreement that it's good to document what widely 
deployed user agents actually do with respect to content sniffing. 
However, there was *no* agreement that it's HTML5's task to make that a 
"MUST" level requirement 
(<http://lists.w3.org/Archives/Public/public-html/2008Jan/0214.html>).

Also, if it's still the goal to reduce the amount of content where 
content sniffing takes place, then it would be useful to make it easier 
for an author to actually find out that content sniffing took place. 
Thus, user agents that do content sniffing SHOULD offer a way to (1) 
turn if off and/or (2) notify the user when the UA decided to override 
the specified content type 
(<http://lists.w3.org/Archives/Public/public-html/2008Jan/0260.html>).

More feedback appreciated,

Julian

Received on Saturday, 2 February 2008 11:46:35 UTC