Re: Unknown text/* subtypes [i20] from Roy T. Fielding on 2008-02-12 (ietf-http-wg@w3.org from January to March 2008)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Tue, 12 Feb 2008 13:12:44 -0800
To: HTTP Working Group <ietf-http-wg@w3.org>
Cc: Julian Reschke <julian.reschke@gmx.de>, Geoffrey Sneddon <foolistbar@googlemail.com>, Mark Nottingham <mnot@mnot.net>, Robert Sayre <rsayre@mozilla.com>
Message-Id: <55862273-E0CB-4B80-B353-2FA63883E6AF@gbiv.com>

On Feb 12, 2008, at 10:01 AM, Robert Sayre wrote:
> On Feb 12, 2008, at 12:49 PM, Julian Reschke wrote:
>> Roy T. Fielding wrote:
>>>
>>> First, that's not true.  Four popular browser implementations do  
>>> not obey
>>> the standard.  They still don't obey the change you made.  They  
>>> do not
>>
>> I'm aware of that, but at least this change would mean that they  
>> are violating one specification less than before.
>
> I agree that browsers do not amount to the measure of all user  
> agents. However, their HTTP behavior will appear in many  
> applications, not just browsers. MSHTML, WebKit, and Gecko have  
> been embedded in many types of apps. Some browsers are relying on  
> OS networking libraries as well--Safari relies on CFNetwork and  
> some Linux browsers use libcurl.

Of course, but they aren't all user agents.  In order to make a  
normative
change to how messages are interpreted (i.e., the entire point of  
this spec)
without changing the HTTP version number, all implementations that were
compliant before must remain compliant after the change.

The charset issue came about originally because of bugs in one important
browser, Mosaic, that was unable to parse parameters on the Content- 
Type.
This was not such a big deal at the time because the entire Web required
iso-8859-1 and Mosaic only supported that charset anyway.  So, the  
solution
imposed (by interoperability with the real world) was that HTTP/1.x  
would
default the charset to iso-8859-1 instead of US-ASCII.  That is one of
the many reasons why HTTP is not MIME-compliant and never has been.
HTTP IS NOT EMAIL.  We have different requirements, different  
performance
limitations, and different implementors.

Here we are, 14 years later, and we have a new set of interoperability
requirements.  Browsers fixed their original charset limitations, a new
somewhat-universal charset is finally supported by somewhat-universal
font availability, and the Web consists of dozens of different charsets,
most of which are left unlabeled because there is no commonly accepted
way of indicating charsets in filename metadata (and no real need to
anyway, since user agents will either sniff the content anyway or just
assume everything is in the fixed local charset known by the tool).

Servers, OTOH, send text/* content with the assumption that it will be
treated as iso-8859-1 (or at least some safe superset of US-ASCII).
Servers don't sniff content because they can't -- it is impossible to
look at every byte of a page while handling 7,000 reqs/sec, let alone
the 20,000 reqs/sec that a decently tuned server can handle.  In  
addition,
some servers (particularly when serving dynamic content) will add a
charset parameter to unlabeled text/html content based upon how they  
have
been configured to scan for cross-site scripting.  They do so  
specifically
because of known bugs in browsers that sniff the content for bizarre
charsets that bypass the resource's security assumptions and
cause the browser's user to fall victim to stupid XSS attacks.

None of these implementations assume that a missing charset means
US-ASCII.  We cannot "pass the buck" to MIME because we are still
not MIME-compliant and never will be (see Content-Encoding).

So, what is interoperable now?

The answer is that iso-8859-1 is still the most interoperable default
*with* the addition of safe sniffing only when the charset is left
unlabeled or when charset="iso-8859-1".  By safe sniffing, I mean
specifically excluding any charset-switching in mid-content
for which the text media type's delimiter set (e.g., <"':> in HTML)
would be mapped to different octets than they are in US-ASCII.
In other words, it is safe to sniff for charsets in the first ten
or so characters, and also to switch to other US-ASCII supersets
after reading something like the <meta http-equiv="content-type" ...>,
but it is definitely unsafe to continue sniffing for charset changes
after that point unless they are restricted to US-ASCII supersets.

That allows HTTP/1.1 compliant serving today to remain compliant
after the change, and addresses all of the interoperability issues
in regard to mislabeled content without ignoring the fact that the
main reason they are mislabeled today is to work around existing
bugs.  For all other cases, the charset can and should be labeled
correctly.

....Roy

Received on Tuesday, 12 February 2008 21:12:35 UTC