- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 26 Mar 2008 19:37:30 +0900
- To: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Hello Mark, Thanks for laying out the issues. I think there is some relationship, but it's still good to untangle things. At 22:15 08/03/25, Mark Nottingham wrote: > > >Trying to summarise, I think there are two separable issues here; > >* 1. Should HTTPBIS continue to accommodate historical clients that >assume that an unlabeled text/* type is iso-8859-1, rather than MIME's >default ASCII? > >Roy has argued that this is an important distinction that we should >continue to make, as otherwise existing implementations will become >non-conformant. Frank points out that no such implementations are in >common use today, and that those implementations which did make this >assumption have greater problems (e.g., lacking Host headers). I think Roy is right in that there was indeed a distinction between HTTP and email here. But Frank is right in that in particular for HTML and XML (not the smallest 'customers' of HTTP to say the least), such implementations are indeed not in common use as far as I know. The reason for this is that in contrast to e.g. text/plain, HTML and XML come with their internal character encoding indication mechanism, see below. >It would be good to hear if anyone else has an opinion, especially if >they have experience with / information about such clients, or content >which relies upon this default. > >The conservative thing to do seems to be to keep the status quo. If we >do that, rather than just close the issue as WONTFIX, we could modify >the current text to clarify the defaulting (the original question was >one of precedence between HTTP defaulting and that defined by the >media type in question), and perhaps give a bit of the history. > > >* 2. Should HTTPBIS countenance sniffing for character set on text/* >types? > a. ...when the charset parameter is not present? > b. ...when the charset parameter is iso-8859-1 > c. ...at other times? > >A few people have noted a security issue in a widely-used browser that >requires (b). However, I haven't seen a reference to a vulnerability >report, etc. yet; is anyone aware of one? The vulnerability I know about is that there are quite a few cases of charset combinations where a wrong charset label, interpreted at face value, can lead to problems. This can happen in the context of cross-site scripting. I created a (essentially harmless) example somewhere on the W3C Web site (security by obscurity, but I can dig it up) when I was still at W3C. [At some point in time, the Apache standard distribution came with AddDefaultCharset iso-8859-1 and a comment claiming that this would always be a good idea, but this was fixed, see http://mail-archives.apache.org/mod_mbox/httpd-cvs/200502.mbox/%3C20050204000827.51572.qmail@minotaur.apache.org%3E] >Some people have spoken in favour of (a), It depends a lot on what is meant by "sniffing". For HTML and XML, there are quite well established methods to look inside the document and find some information about the charset in there; we have to make sure we write the HTTP spec so that this is allowed. A proposal circulated earlier on this list somehow went in this direction, but it mentioned something like "the first 16 bytes" or so, which doesn't take into account the actual length up to an "encoding" pseudo-attribute in the XML declaration, much less the length up to a <meta> element in HTML. While this mechanism relies on bootstrapping the encoding and therefore cannot handle any completely new encoding methods (if such encoding methods would be invented), it works very reliably for practical cases. So in my view, (a) clearly has to be allowed in as far as it refers to content types (e.g. HTML, XML) with a well-defined way to indicate the character encoding internally, and the extraction and usage of this information. Overall, in a case such as HTML, the following is a list of the priorities of charset information as I think they are mostly being used in browsers, of if not, should be used: 1. Explicit, per-document override by user (after the document has been received and looked at, always needed as a last resort because sometimes the label is wrong, wherever it may come from) [no need to talk about this in the HTTP spec] 2. *explicit* external information, the charset parameter on Content-Type 3. *explicit* information internal to the document for media types where this well-defined [4. potentially information from a link that was followed, although I don't think this is widely implemented or used] 5. A 'default' setting on the browser side for unlabeled documents. In many cases, the purpose of this is to simply indicate the charset that is expected in documents that the user is going to view most frequently. This may be iso-8859-1 (or actually windows-1252) in Western Europe and much of the Americas, but by virtue of that charset being used widely in these areas, not by virtue of anything in the HTTP spec. It may be iso-8859-2 or so in some parts of Eastern Europe. It is typically a "guess Japanese encoding" here in Japan, because when limited to charsets customarily used with Japanese, detecting the actual encoding actually works very well on a reasonably-sized document. It may in some cases also include a "guessing" option that tries to guess among any and all charsets available to the browser (full-fledged "sniffing"). In order to be in line with current practice, we have to make sure that we don't write the spec to disallow 3. [or 4.] or 5. >but I note with interest this text in p3, 3.1.1; >> Some HTTP/1.0 software has interpreted a Content-Type header without >> charset parameter incorrectly to mean "recipient should guess." This was and is not in line with current practice, and should be replaced. >> Senders wishing to defeat this behavior MAY include a charset >> parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and >> SHOULD do so when it is known that it will not confuse the recipient. The fact that this is only a SHOULD is due to the fact, mentioned in my other mail, that older (VERY older, indeed) clients fell over in the face of a charset parameter. >If we allow either (a) or (b), this will have to be re-worked. Yes indeed. Regards, Martin. >Also, it's notable that allowing (a) may make (1) easier to resolve in >favour of dropping the HTTP-specific default; i.e., the default would >shift from iso-8859-1 to "sniff". > >Does anyone think that these are so intertwined that (2) should not be >a separate issue? If we can resolve it, (1) should follow. > >Cheers, > >-- >Mark Nottingham http://www.mnot.net/ > > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 26 March 2008 10:52:36 UTC