Re: Default charsets for text media types [i20] from Martin Duerst on 2008-03-26 (ietf-http-wg@w3.org from January to March 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 26 Mar 2008 19:37:30 +0900
To: Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <6.0.0.20.2.20080326182835.04c7bec0@localhost>
Hello Mark,

Thanks for laying out the issues. I think there is some
relationship, but it's still good to untangle things.

At 22:15 08/03/25, Mark Nottingham wrote:
>
>
>Trying to summarise, I think there are two separable issues here;
>
>* 1. Should HTTPBIS continue to accommodate historical clients that  
>assume that an unlabeled text/* type is iso-8859-1, rather than MIME's  
>default ASCII?
>
>Roy has argued that this is an important distinction that we should  
>continue to make, as otherwise existing implementations will become  
>non-conformant. Frank points out that no such implementations are in  
>common use today, and that those implementations which did make this  
>assumption have greater problems (e.g., lacking Host headers).

I think Roy is right in that there was indeed a distinction between
HTTP and email here. But Frank is right in that in particular for
HTML and XML (not the smallest 'customers' of HTTP to say the least),
such implementations are indeed not in common use as far as I know.

The reason for this is that in contrast to e.g. text/plain,
HTML and XML come with their internal character encoding
indication mechanism, see below.


>It would be good to hear if anyone else has an opinion, especially if  
>they have experience with / information about such clients, or content  
>which relies upon this default.
>
>The conservative thing to do seems to be to keep the status quo. If we  
>do that, rather than just close the issue as WONTFIX, we could modify  
>the current text to clarify the defaulting (the original question was  
>one of precedence between HTTP defaulting and that defined by the  
>media type in question), and perhaps give a bit of the history.
>
>
>* 2. Should HTTPBIS countenance sniffing for character set on text/*  
>types?
>    a. ...when the charset parameter is not present?
>    b. ...when the charset parameter is iso-8859-1
>    c. ...at other times?
>
>A few people have noted a security issue in a widely-used browser that  
>requires (b). However, I haven't seen a reference to a vulnerability  
>report, etc. yet; is anyone aware of one?

The vulnerability I know about is that there are quite a few cases
of charset combinations where a wrong charset label, interpreted at
face value, can lead to problems. This can happen in the context
of cross-site scripting. I created a (essentially harmless)
example somewhere on the W3C Web site (security by obscurity,
but I can dig it up) when I was still at W3C.

[At some point in time, the Apache standard distribution came with
AddDefaultCharset iso-8859-1
and a comment claiming that this would always be a good idea, but
this was fixed, see
http://mail-archives.apache.org/mod_mbox/httpd-cvs/200502.mbox/%3C20050204000827.51572.qmail@minotaur.apache.org%3E]

>Some people have spoken in favour of (a),

It depends a lot on what is meant by "sniffing". For HTML and XML,
there are quite well established methods to look inside the document
and find some information about the charset in there; we have to make
sure we write the HTTP spec so that this is allowed. A proposal
circulated earlier on this list somehow went in this direction, but
it mentioned something like "the first 16 bytes" or so, which doesn't
take into account the actual length up to an "encoding" pseudo-attribute
in the XML declaration, much less the length up to a <meta> element
in HTML. While this mechanism relies on bootstrapping the encoding
and therefore cannot handle any completely new encoding methods
(if such encoding methods would be invented), it works very reliably
for practical cases.

So in my view, (a) clearly has to be allowed in as far as it refers to
content types (e.g. HTML, XML) with a well-defined way to indicate
the character encoding internally, and the extraction and usage of
this information.

Overall, in a case such as HTML, the following is a list of the
priorities of charset information as I think they are mostly being
used in browsers, of if not, should be used:

1. Explicit, per-document override by user
   (after the document has been received and looked at,
    always needed as a last resort because sometimes the label
    is wrong, wherever it may come from)
   [no need to talk about this in the HTTP spec]

2. *explicit* external information, the charset parameter on Content-Type

3. *explicit* information internal to the document for media types
   where this well-defined

[4. potentially information from a link that was followed, although
    I don't think this is widely implemented or used]

5. A 'default' setting on the browser side for unlabeled documents.
   In many cases, the purpose of this is to simply indicate the
   charset that is expected in documents that the user is going to
   view most frequently. This may be iso-8859-1 (or actually
   windows-1252) in Western Europe and much of the Americas, but
   by virtue of that charset being used widely in these areas,
   not by virtue of anything in the HTTP spec. It may be
   iso-8859-2 or so in some parts of Eastern Europe. It is
   typically a "guess Japanese encoding" here in Japan, because
   when limited to charsets customarily used with Japanese,
   detecting the actual encoding actually works very well on
   a reasonably-sized document. It may in some cases also
   include a "guessing" option that tries to guess among any
   and all charsets available to the browser (full-fledged
   "sniffing").

In order to be in line with current practice, we have to make
sure that we don't write the spec to disallow 3. [or 4.] or 5.
  

>but I note with interest this text in p3, 3.1.1;
>> Some HTTP/1.0 software has interpreted a Content-Type header without  
>> charset parameter incorrectly to mean "recipient should guess."

This was and is not in line with current practice, and should be
replaced.

>> Senders wishing to defeat this behavior MAY include a charset  
>> parameter even when the charset is ISO-8859-1 ([ISO-8859-1]) and  
>> SHOULD do so when it is known that it will not confuse the recipient.

The fact that this is only a SHOULD is due to the fact, mentioned in
my other mail, that older (VERY older, indeed) clients fell over
in the face of a charset parameter.

>If we allow either (a) or (b), this will have to be re-worked.

Yes indeed.

Regards,    Martin.

>Also, it's notable that allowing (a) may make (1) easier to resolve in  
>favour of dropping the HTTP-specific default; i.e., the default would  
>shift from iso-8859-1 to "sniff".
>
>Does anyone think that these are so intertwined that (2) should not be  
>a separate issue? If we can resolve it, (1) should follow.
>
>Cheers,
>
>--
>Mark Nottingham     http://www.mnot.net/
>
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 26 March 2008 10:52:36 UTC