stability of content type sniffing algorithm? contentTypeOverride-24 / issue-24 from Dan Connolly on 2009-05-28 (www-tag@w3.org from May 2009)

From: Dan Connolly <connolly@w3.org>
Date: Thu, 28 May 2009 11:34:27 -0500
To: www-tag@w3.org
Message-Id: <1243528467.5864.7928.camel@pav.lan>
I recently gave the mime-sniff a somewhat closer look,
including these two paragraphs, which looked familiar:

[[
   This document describes a mime sniffing algorithm that carefully
   balances the compatibility needs of browser vendors with the security
   constraints.  The algorithm has been constructed with reference to
   mime sniffing algorithms present in popular Web browsers, an
   extensive database of Web content, and metrics collected from
   implementations deployed to a sizable number of Web users.

   Warning!  It is imperative that the algorithm in this document be
   followed exactly.  When a user agent uses different heuristics for
   content type detection than the server expects, security problems can
   occur.  For example, if a server believes that the client will treat
   a contributed file as an image (and thus treat it as benign), but a
   Web browser believes the content to be HTML (and thus execute any
   scripts contained therein), the end user can be exposed to malicious
   content, making the user vulnerable to cookie theft attacks and other
   cross-site scripting attacks.
]]
 -- http://ietfreport.isoc.org/idref/draft-abarth-mime-sniff/

I had an uneasiness about them that I wasn't sure how to articulate,
but then I just read this:

-------- Forwarded Message --------
http://lists.w3.org/Archives/Public/public-html/2009May/0524.html
> From: Sam Ruby <rubys@intertwingly.net>
> To: Anne van Kesteren <annevk@opera.com>
> Cc: Maciej Stachowiak <mjs@apple.com>, Roy T. Fielding
> <fielding@gbiv.com>, Larry Masinter <masinter@adobe.com>, HTML WG
> <public-html@w3.org>
> Subject: Re: HTML interpreter vs. HTML user agent
> Date: Thu, 28 May 2009 09:41:36 -0400
[...]
> The actual observed behavior of user agents designed to (primarily) 
> process content of a certain media type (either in general, or in the 
> specific context) is to make every effort to parse the content according 
> to those rules, and only if such rules fail to produce meaningful 
> results will they investigate alternatives.
> 
> Browsers will first attempt to process content as HTML.
> FeedReaders will first attempt to process content as a feed.
> Media plays will first attempt to process content as media.
> 
> Browsers, when chasing an image tag, will make different assumptions 
> than when presented with a raw uri from the chrome.
> 
> All are equally "right" or "wrong".
> 
> None of this is meant to imply that the behavior that is being settled 
> upon by browser manufacturers isn't worth specifying or standardizing.
> 
> - Sam Ruby

Is there any reason to believe that the next sort of content
to hit the web won't disrupt things much like java .jar files
and RSS/Atom feeds and mp3/wma media?

I think it's worthwhile to update our finding on authoritative
metadata* to acknowledge draft-abarth-mime-sniff and the practice
it represents... but I'm struggling to figure out exactly
what to say.

 * http://www.w3.org/2001/tag/doc/mime-respect-20060412

It's pretty clear to me that people will take the shortest path
to their target, and that usually doesn't involve editing
the .htaccess file when they test their RSS file with their
RSS readers. It's not until the RSS reader gets integrated
into the web browser that the HTTP client's presumption
is that it's getting a feed goes away (and even then,
not completely).


-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
gpg D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Thursday, 28 May 2009 16:34:33 UTC