- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Tue, 12 Feb 2008 13:12:44 -0800
- To: HTTP Working Group <ietf-http-wg@w3.org>
- Cc: Julian Reschke <julian.reschke@gmx.de>, Geoffrey Sneddon <foolistbar@googlemail.com>, Mark Nottingham <mnot@mnot.net>, Robert Sayre <rsayre@mozilla.com>
On Feb 12, 2008, at 10:01 AM, Robert Sayre wrote: > On Feb 12, 2008, at 12:49 PM, Julian Reschke wrote: >> Roy T. Fielding wrote: >>> >>> First, that's not true. Four popular browser implementations do >>> not obey >>> the standard. They still don't obey the change you made. They >>> do not >> >> I'm aware of that, but at least this change would mean that they >> are violating one specification less than before. > > I agree that browsers do not amount to the measure of all user > agents. However, their HTTP behavior will appear in many > applications, not just browsers. MSHTML, WebKit, and Gecko have > been embedded in many types of apps. Some browsers are relying on > OS networking libraries as well--Safari relies on CFNetwork and > some Linux browsers use libcurl. Of course, but they aren't all user agents. In order to make a normative change to how messages are interpreted (i.e., the entire point of this spec) without changing the HTTP version number, all implementations that were compliant before must remain compliant after the change. The charset issue came about originally because of bugs in one important browser, Mosaic, that was unable to parse parameters on the Content- Type. This was not such a big deal at the time because the entire Web required iso-8859-1 and Mosaic only supported that charset anyway. So, the solution imposed (by interoperability with the real world) was that HTTP/1.x would default the charset to iso-8859-1 instead of US-ASCII. That is one of the many reasons why HTTP is not MIME-compliant and never has been. HTTP IS NOT EMAIL. We have different requirements, different performance limitations, and different implementors. Here we are, 14 years later, and we have a new set of interoperability requirements. Browsers fixed their original charset limitations, a new somewhat-universal charset is finally supported by somewhat-universal font availability, and the Web consists of dozens of different charsets, most of which are left unlabeled because there is no commonly accepted way of indicating charsets in filename metadata (and no real need to anyway, since user agents will either sniff the content anyway or just assume everything is in the fixed local charset known by the tool). Servers, OTOH, send text/* content with the assumption that it will be treated as iso-8859-1 (or at least some safe superset of US-ASCII). Servers don't sniff content because they can't -- it is impossible to look at every byte of a page while handling 7,000 reqs/sec, let alone the 20,000 reqs/sec that a decently tuned server can handle. In addition, some servers (particularly when serving dynamic content) will add a charset parameter to unlabeled text/html content based upon how they have been configured to scan for cross-site scripting. They do so specifically because of known bugs in browsers that sniff the content for bizarre charsets that bypass the resource's security assumptions and cause the browser's user to fall victim to stupid XSS attacks. None of these implementations assume that a missing charset means US-ASCII. We cannot "pass the buck" to MIME because we are still not MIME-compliant and never will be (see Content-Encoding). So, what is interoperable now? The answer is that iso-8859-1 is still the most interoperable default *with* the addition of safe sniffing only when the charset is left unlabeled or when charset="iso-8859-1". By safe sniffing, I mean specifically excluding any charset-switching in mid-content for which the text media type's delimiter set (e.g., <"':> in HTML) would be mapped to different octets than they are in US-ASCII. In other words, it is safe to sniff for charsets in the first ten or so characters, and also to switch to other US-ASCII supersets after reading something like the <meta http-equiv="content-type" ...>, but it is definitely unsafe to continue sniffing for charset changes after that point unless they are restricted to US-ASCII supersets. That allows HTTP/1.1 compliant serving today to remain compliant after the change, and addresses all of the interoperability issues in regard to mislabeled content without ignoring the fact that the main reason they are mislabeled today is to work around existing bugs. For all other cases, the charset can and should be labeled correctly. ....Roy
Received on Tuesday, 12 February 2008 21:12:35 UTC