Re: several messages about content sniffing in HTML from Ian Hickson on 2008-02-29 (public-html@w3.org from February 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 29 Feb 2008 09:10:46 +0000 (UTC)
To: Sam Ruby <rubys@intertwingly.net>, Robert Sayre <sayrer@gmail.com>, Anne van Kesteren <annevk@opera.com>, Sander Tekelenburg <st@isoc.nl>, Geoffrey Sneddon <foolistbar@googlemail.com>, ryan <ryan@theryanking.com>, Hugh Winkler <hughw@wellstorm.com>, Boris Zbarsky <bzbarsky@MIT.EDU>, Julian Reschke <julian.reschke@gmx.de>, Maciej Stachowiak <mjs@apple.com>
Cc: WHATWG <whatwg@whatwg.org>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0802290308050.6407@hixie.dreamhostps.com>
Executive summary:

 * Made the feed sniffer skip BOMs.

 * Define how to handle duplicate Content-Type headers, since HTTP doesn't 
   define this and someone has to.

I didn't make any other normative changes, but mostly because I'm not sure 
exactly what needs changing. I encourage the people on the To: line of 
this e-mail to read this e-mail in particular and respond to any questions 
I may have asked in response to your feedback. Thanks!


On Tue, 5 Dec 2006, Sam Ruby wrote:
> 
> I have a request.  It would be nice if the sniffing algorithm made an 
> exception for "text/plain".  Use case:
> 
> http://svn.smedbergs.us/wordpress-atom10/tags/0.6/wp-atom10-comments.php

It does (except if the content is non-conforming text/plain, in which 
case it gets treated as application/octet-stream).


On Wed, 6 Dec 2006, Robert Sayre wrote:
> 
> It was fixed in a way that covers mis-sniffed feed content
> specifically. That is, content that was sniffed as a feed but isn't
> one, like that Atom template or some FOAF files, are displayed
> correctly. These are edge cases.
> 
> > Both so that Ian's eventual text can be consistent with the fix, and 
> > for my edification as I would love to be able to directly view my test 
> > cases again:
> 
> Your test cases are a different bug: "correctly" sniffed feeds that you 
> don't want sniffed. Unfortunately, I can't agree that the MIME type 
> "text/plain" carries as strong a message as it used to.
> 
> It would be possible to turn off sniffing for some 'text/plain' values 
> if there were a better indicator available. For example, by using a new 
> Content-Disposition value (web compatible because unknown values are 
> treated as 'inline').

Given that Atom feeds can run script, it seems like going from text/plain 
to an Atom feed is a privilege escalation bug. While I would find that 
acceptable if someone were to use a <link rel=feed> link (or equivalent), 
it seems highly undesireable to do this for browsing context navigation.


On Tue, 22 May 2007, Anne van Kesteren wrote:
>
> Style sheet loading and parsing (over HTTP). For compatibility with the 
> web it seems important to simply ignore Content-Type in all modes. 
> Firefox has some hack where they "respect" Content-Type in standards 
> mode except when the response Content-Type doesn't contain a "/" or "\". 
> For instance
> 
>   Content-Type: "null"
> 
> would be applied.

Such a Content-Type is invalid, so it'd be like no Content-Type, which 
would be sniffed as text/css... that seems conforming to me. I just made 
the spec more explicit about this, but it was already defined that way.


> Internet Explorer doesn't respect Content-Type at all either. However, 
> it does respect HTTP status codes. So redirects are followed and 
> responses with status codes that indicate some type of error (404, 410, 
> 501, etc.) are not parsed as style sheets. Anything that ends up with a 
> status code of 200 that is fetched from a "style sheet loader" (<link 
> rel=stylesheet>, @import) is parsed and applied.
> 
> It would be nice if the specification said something along those lines.

Doesn't HTTP already say that...? I've added a paragraph saying that HTTP 
must be followed, for that it's worth.


On Wed, 23 May 2007, Sander Tekelenburg wrote:
>
> Anne, you seem to mean to refer to Style Sheets's Content-Types only, 
> but given some of the responses, and some other discussions about 
> Content-Type, I take the liberty to interpret this as a more general 
> argument against Content-Type.

I think he just meant rel=stylesheet.


> At 10:44 +0200 UTC, on 2007-05-22, Anne van Kesteren wrote:
> 
> > For compatibility with the web it seems important to simply ignore 
> > Content-Type in all modes.
> 
> I'm confused about "in all modes" in this context. I thouht the idea was 
> to do away with modes altogether?

No, HTML5 has the three modes legacy has presented us with. (That's one 
reason I'm so scared of Microsoft's move with IE8 -- I don't want to have 
to specify a dozen more undocumented bug sets).


> With Content-Type, one can serve HTML, CSS, PHP, etc. as text/plain. 
> Useful to provide example code. I'm sure there are more use cases where 
> there is no single correct interpretation other than the one the author 
> specifies. Should we really make that impossible?

No. And indeed the spec does not make it impossible.


> With content-sniffing, users need fetch images even when they cannot see 
> them, audio even when they cannot hear it. ["fetch" == wait for the data 
> transfer, pay for the traffic, and wait for the content-sniffing parser 
> to do its dance.]

I'm not sure why content-sniffing affects this. It's not like UAs do HEAD 
round trips on every resource.


> What about new types of content? It seems to me that relying in 
> content-sniffing would mean that a new file format would have to be 
> registered by browsers before they can do anything useful with it. With 
> Content-Type OTOH, a browser can always be configured to do somethig 
> useful (pass it on to an appropriate helper app) with a particular new 
> file type.

Indeed.


On Fri, 17 Aug 2007, Geoffrey Sneddon wrote:
>
> Step 10 of Feed/HTML sniffing (part of detailed review of "Determining 
> the type of a new resource in a browsing context"), as of writing, is an 
> unresolved issue: "If, before the next ">", you find two xmlns* 
> attributes with http://www.w3.org/1999/02/22-rdf-syntax-ns# and 
> http://purl.org/rss/1.0/ as the namespaces, then the sniffed type of the 
> resource is "application/rss+xml", abort these steps. (maybe we only 
> need to check for http://purl.org/rss/1.0/ actually)"
> 
> The first, and most obvious issue, is that RSS 0.90's namespace 
> (<http://my.netscape.com/rdf/simple/0.9/>) should be an alternative for 
> RSS 1.0's here.

Why? I'm trying to keep this at the absolute minimum required, and as far 
as I know browsers don't use that namespace in their sniffing algorithms.


> Secondly, testing current browsers, I've found that Fx2, Op9.22, and IE7 
> require both the RDF and RSS namespace to be present. Saf3/Mac requires 
> just the |rdf:RDF| element.

Indeed; that's what the spec's text is based on. The question is whether 
there are any pages that don't have both that would be hurt, rather than 
helped, if we only looked for one of the two.


> I'm yet to come across any site that breaks in Saf3 (if Saf2 used the 
> same feed/HTML hysterics, I haven't had a single site break since May 
> 2005) due to this sniffing, so I'm in favour of just putting |rdf:RDF| 
> within the above table with a sniffed type of "application/rss+xml" 
> (which is simplest solution to this issue).

This unfortunately would break any RDF page. Not that there are many, I 
understand, but hypothetically, at least. I mean, I know I've not been a 
big fan of the Semantic Web, but I'd rather it failed on its own merits 
instead of us forcing its demise by making all mislabelled RDF pages get 
sniffed as RSS...


On Fri, 17 Aug 2007, Geoffrey Sneddon wrote:
>
> "Determining the type of a new resource in a browsing context"
> 
> The only issue I've so far seen in implementing it (and don't expect 
> this to show up in php-html-5-direct any time soon � I'm being paid to 
> implement this, and therefore I'm not going through the stages of 
> writing a 1:1 implementation before optimising it, and writing test 
> cases) is the duplication of "Otherwise" in feed or HTML step 6. Perhaps 
> step 4 should just be merged into step 3 (with something like 
> "Otherwise, increase pos by 1 and return to step 2 in these substeps.")?

Fixed.


On Mon, 20 Aug 2007, ryan wrote:
> 
> Section 4.7.4, which deals with sniffing for different content types, 
> has no mention of BOMs.[1]

Fixed.


On Sat, 8 Sep 2007, Geoffrey Sneddon wrote:
> 
> Currently we only sniff text/plain (in certain conditions, being there 
> is no content-encoding headers and content-type is equal to one of 
> "text/plain", "text/plain; charset=ISO-8859-1", or "text/plain; 
> charset=iso-8859-1") to see whether it is binary content or not. 
> However, this poses issues for a large number of feeds that are served 
> as text/plain: a notable example of this is 
> <http://youtube.com/rss/global/top_favorites.rss>.

Yes.


On Sun, 9 Sep 2007, ryan wrote:
> 
> As someone who works on a spider that uses feeds heavily, the only way 
> we've found to make it work is to always assume that if it looks like a 
> feed, it should be treated as such. Interactive user agents may have 
> different constraints that lead to different solutions.

That kinda makes you part of the problem. :-P


On Fri, 16 Nov 2007, Hugh Winkler wrote:
>
> In section 4.9 [1]
> 
> "It is imperative that the rules in this section be followed exactly. 
> When two user agents use different heuristics for content type 
> detection, security problems can occur. For example, ..."
> 
> I'm expecting an example of a security problem arising due to two user 
> agents using different heuristics. But what follows isn't very focused:
> 
> "...if a server believes a contributed file to be an image (and thus 
> benign), but a Web browser believes the content to be HTML (and thus 
> capable of executing script), the end user can be exposed to malicious 
> content, "
> 
> Malicious content.... that's bad...
> 
> "...making the user vulnerable to cookie theft attacks and other 
> cross-site scripting attacks."
> 
> I guess so.

Actually the example described (vaguely) above is a real one. Alice 
uploads a PNG to a site, her browsers treats it as a PNG, everything is 
fine. Bob comes along, using IE, and IE decides that it's not really a 
PNG, it's an HTML file, and Bob finds that PNG suddenly runs scripts on 
the very same domain as the site. Alice just stole Bob's mojo (or whatever 
it is the site is protecting on Bob's behalf). (This can happen even with 
valid PNGs.)

The point is just that the sniffing, which is sadly required for 
compatibility with the Web, has to be predictable (the same for every 
browser), so that the server doesn't find things unexpectedly turn into 
security vulnerabilities on different clients than were tested.


> The bit about the two user agents never materializes: We have just a 
> server and a user agent.

Hm, valid. I've changed this to talk about client and server, not two 
clients.


On Mon, 19 Nov 2007, Boris Zbarsky wrote:
> Julian Reschke wrote:
> > Multiple media-type values? What would that be good for?
> 
> Rendering the web?  In particular, it's not uncommon for servers (esp. 
> when CGIs are involved) to produce things like:
> 
>   Content-Type: text/html; charset=ISO-8859-1
>   Content-Type: text/plain
> 
> which then get normalized to:
> 
>   Content-Type: text/html; charset=ISO-8859-1, text/plain
> 
> Not sure where that normalization happens offhand (server end or Gecko 
> end).

It seems like the HTTP spec should define how to handle that, but the HTTP 
working group has indicated a desire to not specify error handling 
behaviour, so I guess it's up to us.

IE and Safari use the first one, Firefox and Opera use the last one. I 
guess we'll use the first one.


On Tue, 20 Nov 2007, Julian Reschke wrote:
> 
> Having multiple Content-Type headers makes the HTTP message invalid 
> (<http://tools.ietf.org/html/rfc2616#section-4.2>). And yes, I 
> understand the choices for the UA at this point aren't pretty:
> 
> 1) report an error and abort
> 2) ignore the Content-Type headers
> 3) pick one of them (how)
> 4) ...more?

Well, ideally there wouldn't be a choice, HTTP would, like any decent 
spec, cover requirements for error cases as well as valid cases.


On Fri, 25 Jan 2008, Julian Reschke wrote:
> Mark Baker wrote:
> > 
> > A lot of work has gone into sec 4.9, and it's useful for everybody to 
> > know what is currently common practice so I'm all for keeping it.  
> > But what is accomplished by making it normative exactly?

As the spec says, it is imperative that all browsers converge on a common 
(and hopefully minimal) set of heuristics to avoid security problems.


> > Content sniffing is a bug, and IMO we shouldn't mandate that these 
> > bugs needn't be fixed.

Content sniffing is required to browser the Web. Interoperability is worth 
far more than blind adherence to standards. In fact, interoperability is 
exactly what adherence to standards is all about.


> In particular, it seems that neither FF2 nor FF3 follow these rules with 
> respect to ignoring text/plain in certain situations (test cases 008, 
> 009, 010 in 
> <http://www.hixie.ch/tests/adhoc/http/content-type/sniffing/>). So I'd 
> really like to understand why this can be considered a "MUST" level 
> requirement when we have proof that popular browsers can get away with 
> *not* ignoring the Content-Type header here.

The spec goes much further, it actually requires _no_ content sniffing for 
any of those tests right now. (Those tests are very out of date.)


On Fri, 25 Jan 2008, Boris Zbarsky wrote:
> 
> Those are sent with "Content-Encoding: gzip".  Due to an internal 
> limitation, Gecko does not sniff such content at the moment (basically, 
> because sniffing would involve undoing the content encoding first, since 
> sniffing the gzipped data is pointless).  If the test sent the data 
> without Content-Encoding (which is the usual situation for the cases the 
> sniffing is designed to address), those tests would get sniffed as 
> binary.
> 
> Oh, and we really do plan to addres the gzip limitation at some point, 
> just so things are consistent and people don't get confused as you did 
> here...

Actually the spec right now requires that there be no content sniffing if 
the Content-Encoding header is set... are you running into cases where 
that is a problem?


On Fri, 25 Jan 2008, Julian Reschke wrote:
> 
> OK, but that means the test cases are broken, right?

Yes, those test cases were written long ago, before the spec.


On Fri, 25 Jan 2008, Boris Zbarsky wrote:
> 
> There are a number of sites that do get broken by the Content-Encoding 
> thing. Not as many as by not sniffing at all, but enough.

So should we remove this exception from the spec and require sniffing even 
with Content-Encoding present?


On Fri, 25 Jan 2008, Boris Zbarsky wrote:
> 
> One more thought I had about this today.  Is real reason the sniffing in 
> the spec is a MUST because UAs must not do any sniffing other than 
> what's specified?  If so, it might make more sense to say that as a MUST 
> and say the existing sniffing stuff as a MAY.

That's what the spec says, as far as I can tell. (It allows several 
aspects of the various sniffing requirements to be bypassed, but requires 
that if it isn't, it be implemented as per the spec.)


On Fri, 25 Jan 2008, Maciej Stachowiak wrote:
> 
> In other cases (like <img> entirely ignoring the server-reported 
> Content-Type for binary image formats), I think the requirement should 
> remain a MUST for interoperability. Binary image formats already have 
> unique in-stream identifiers and it doesn't ever make sense to treat a 
> GIF as a JPEG so this doesn't have the same issues as sniffing for 
> binaries or RSS feeds where there could indeed be ambiguity.

The "skipping" allowed for sniffing doesn't apply to <img> sniffing.


On Fri, 25 Jan 2008, Boris Zbarsky wrote:
> 
> Oh, one more note.  Gecko's sniffing behavior actually had to be changed 
> recently.  Unfortunately, the more recent Apache installs changed from 
> ISO-8859-1 to UTF-8 as the default encoding, without changing the 
> default content type behavior.  So at this point, in Gecko, data flagged 
> as "text/plain; charset=UTF-8" is also sniffed to see whether it might 
> be binary. Since all of the byte values that trigger the "binary" 
> determination are illegal in UTF-8, as far as I can tell, this shouldn't 
> affect any actual UTF-8 text.  It might be a good idea to update the 
> tests and the spec if people agree, though.

Uppercase only?


On Fri, 25 Jan 2008, Boris Zbarsky wrote:
> Roy wrote:
> > If you start sniffing content with a charset, then you had better 
> > remove support for the charsets that are only used for XSS attacks.

The spec already requires that those not be supported.


> The text content-type sniffing performed by Gecko never results in the 
> browser handling the content as anything other than "text" (per the 
> headers sent by the server) or "binary" (puts up a dialog asking what to 
> do).

The spec's sniffing cannot elevate privileges.

However, the above isn't true as I understand it. Firefox will also, as I 
understand it, sniff text/plain as RSS or Atom feeds.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 29 February 2008 09:11:17 UTC