- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Sun, 22 Nov 2009 14:27:14 +0100
- To: Adam Barth <w3c@adambarth.com>
- CC: Boris Zbarsky <bzbarsky@mit.edu>, Gavin Carothers <gavin@carothers.name>, Maciej Stachowiak <mjs@apple.com>, HTMLwg <public-html@w3.org>
Adam Barth wrote: > ... > That is, in fact, how the sniffing algorithm works: > > http://tools.ietf.org/html/draft-abarth-mime-sniff Thanks for the reminder. > The algorithm tolerates leading white space, but not leading BOMs. Is there a particular reason why the BOM is not tolerated, given <http://www.w3.org/TR/REC-xml/#sec-guessing>? Gavin Carothers wrote: > I think I may have failed to make my point. The HTML standard can just > as easily say "A HTTP server MUST serve HTML documents as text/html." > Accepting malformed documents is great and all, but how far is too > far? That's a pointless requirement. In general, there may be cases where a server doesn't know the content type, and forcing it to label content of unknown type actually is harmful (and leads frequently to mislabeled content, which in turn historically is the *cause* for UAs doing content sniffing overriding the mime type). > Lets consider this Microsoft page in whole. It's served with no media > type. The only browser I've found that can (inconsistently) render it > is IE7... but the page demands that it should be rendered as IE8 does > (white page, no content). Only it doesn't really say IE8, rather it > uses a undocumented setting that uh, doesn't seem to do anything at > all. The top of the document claims it's an XHTML 1.0 document, as > such the html element declares it's namespace to be > http://www.w3.org/1999/xhtml. About half of the script tags are > clearly designed for XML, with CDATA sections wrapping their content, > the other half, no CDATA sections. The default namespace is redeclared > in the middle of the document a number of times, luckily to the same > thing each time. And of course the main bug which causes the page not > to render correctly in just about anything, a BOM marker in UTF-8... > an encoding which has no need for an endianness marker. Halfway down > the document it has a new XML deceleration, this time in UTF-16. There's no question that that particular page is broken. On the other hand, the UTF-8 BOM serves a very useful purpose (auto-detection of the character set when not out-of-band encoding information is available), and therefore I would expect a content sniffing algorithm to take it int account. > Validating the page fails with all XHTML validators, XML validators, > HTML4 validators, and does not render correctly (is there such a > thing?) in any user agent I'm aware of. Let's ignore "correctly" for a second -- it *does* render in IE8, Opera and Safari on Win7. So the two latter UAs seem to differ in the way they do content-sniffing from Firefox. > ... Best regards, Julian
Received on Sunday, 22 November 2009 13:28:00 UTC