Re: Default charsets for text media types [i20]

As for digging on this issue, here a few more quotes from RFC 2070
(Internationalization of the Hypertext Markup Language).
[I was a co-author, but that was a long time ago.]

 From the Introduction:

   The specific issues addressed are the SGML document character set to
   be used for HTML, the proper treatment of the charset parameter
   associated with the "text/html" content type and the specification of
   some additional elements and entities.

1.2.2. User agents

   In addition to the requirements of RFC 1866, the following
   requirements are placed on HTML user agents.

      To ensure interoperability and proper support for at least ISO-
      8859-1 in an environment where character encoding schemes other
      than ISO-8859-1 are present, user agents MUST correctly interpret
      the charset parameter accompanying an HTML document received from
      the network.

      ...

[At the time, there were some user agents that fell over when there
was any charset parameter in a HTTP Content-Type: header at all;
the above was written to make sure that user agents that followed
RFC 2070 would avoid just falling over. My understanding is that
this problem was corrected in version 3 or so of Netscape and IE,
or anyway in a timeframe that makes in irrelevant for our current
spec.]


2.1. Reference processing model

   ...

   For the HTTP protocol [RFC2068], the external character encoding is
   indicated by the "charset" parameter of the "Content-Type" field of
   the header of an HTTP response. For example, to indicate that the
   transmitted document is encoded in the "JUNET" encoding of Japanese
   [RFC1468], the header will contain the following line:

   Content-Type: text/html; charset=ISO-2022-JP

   ...

   Similarly, if HTML documents are transferred by electronic mail, the
   external character encoding is defined by the "charset" parameter of
   the "Content-Type" MIME header field [RFC2045], and defaults to US-
   ASCII in its absence.

[The US-ASCII default is specifically for email; there is no default
 in RFC 2070 for HTTP, for various good reasons, please see separate mail.]

   ...

6. External character encoding issues

   Proper interpretation of a text document requires that the character
   encoding scheme be known.  Current HTTP servers, however, do not
   generally include an appropriate charset parameter with the Content-
   Type header.  This is bad behaviour, which is even encouraged by the
   continued existence of browsers that declare an unrecognized media
   type when they receive a charset parameter.  User agent
   implementators are strongly encouraged to make their software
   tolerant of this parameter, even if they cannot take advantage of it.

[the (hopefully historical) issue mentioned above]

   ...

   In the case where a document is accessed from a hyperlink in an
   origin HTML document, a CHARSET attribute is added to the attribute
   list of elements with link semantics (A and LINK), specifically by
   adding it to the linkExtraAttributes entity.  The value of that
   attribute is to be considered a hint to the User Agent as to the
   character encoding scheme used by the resource pointed to by the
   hyperlink; it should be the appropriate value of the MIME charset
   parameter for that resource.

[not sure how much this is implemented or in use; it's not directly
a HTTP issue]

   In any document, it is possible to include an indication of the
   encoding scheme like the following, as early as possible within the
   HEAD of the document:

    <META HTTP-EQUIV="Content-Type"
     CONTENT="text/html; charset=ISO-2022-JP">

   This is not foolproof, but will work if the encoding scheme is such
   that ASCII-valued octets stand for ASCII characters only at least
   until the META element is parsed.

[This is very, very widely used. As far as it's HTML, it's nothing
HTTP should be concerned, but it is highly relevant for HTTP because
it is dead straight against any default on the charset parameter in
HTTP.]

                                      Note that there are better ways
   for a server to obtain character encoding information, instead of the
   unreliable META above; see [NICOL2] for some details and a proposal.

[Yes indeed, servers never tried to get the charset info out of the
document itself, although (as far as I have been told) that was the
purpose for which this variant of the <meta> syntax had been designed for.]


   For definiteness, the "charset" parameter received from the source of
   the document should be considered the most authoritative, followed in
   order of preference by the contents of a META element such as the
   above, and finally the CHARSET parameter of the anchor that was
   followed (if any).

Not everybody likes this, but it has been around for a long time,
as far as I understand, it's respected by most browsers, and
changing it won't make things better (it's just that the other
half of the people involved will be unhappy).

Regards,    Martin.


At 03:41 08/03/26, Frank Ellermann wrote:
>
>Simon Perreault wrote:
> 
>> I investigated and came to the conclusion that MIME doesn't
>> specify that text/* has a default character set of ASCII.
>> I may very well be wrong, and this email is also a way to
>> ask you for clarifications.
>
>RFC 2045 chapter 5:
>| For example, the "charset" parameter is applicable to any
>| subtype of "text"
>
>RFC 2045 chapter 5.2:
>|    Content-type: text/plain; charset=us-ascii
>|
>| This default is assumed if no Content-Type header field is
>| specified.
>
>RFC 2046 chapter 4.1.2:
>| Note that the character set used, if anything other than
>| US-ASCII, must always be explicitly specified in the 
>| Content-Type field.
>
>There are additional remarks about treating unknown text/*
>subtypes like text/plain, and for text/plain the default
>US-ASCII is very clear (various places including RFC 2049,
>minimal MIME conformance).
>
>You can argue that text/html is not "unknown" for a typical
>HTTP application.  *Historic* info:
>
>RFC 2070, chapter 2.1:
>| HTML, as an application of SGML, does not directly address
>| the question of the external character encoding. This is
>| deferred to mechanisms external to HTML, such as MIME as
>| used by the HTTP protocol or by electronic mail.
>[...]
>| Similarly, if HTML documents are transferred by electronic
>| mail, the external character encoding is defined by the 
>| "charset" parameter of the "Content-Type" MIME header field
>| [RFC2045], and defaults to US-ASCII in its absence.
>
>It's fun to see how RFC 2070 avoids to mention the Latin-1
>default in RFC 2068 :-)
>
>> It is also mentioned that other text/* types may not even
>> have the charset parameter. So a default value would be 
>> meaningless.
>
>Yes, I'm too lazy to dig in the IANA registry, are you aware
>of a text/* subtype with this property ?
>
>> What I understand from this section is that there is no 
>> default character set for text/*.
>
>Not the only possible interpretation, but without doubt the 
>dubious text/* MIME default is *NOT* Latin-1, for text/html
>in RFC 2070 it is also *NOT* Latin-1.  RFC 2854 is clearer:
>
>RFC 2854 chapter 6 (about text/html):
>| The use of an explicit charset parameter is strongly 
>| recommended.  While [MIME] specifies "The default character
>| set, which must beassumed in the absence of a charset
>| parameter, is US-ASCII."  [HTTP] Section 3.7.1, defines
>| that "media subtypes of the 'text' type are defined to have
>| a default charset value of 'ISO-8859-1'".  Section 19.3 of
>| [HTTP] gives additional guidelines.  Using an explicit
>| charset parameter will help avoid confusion.
>
> Frank


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Wednesday, 26 March 2008 10:46:18 UTC