- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 26 Mar 2008 18:19:16 +0900
- To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>, ietf-http-wg@w3.org
As for digging on this issue, here a few more quotes from RFC 2070 (Internationalization of the Hypertext Markup Language). [I was a co-author, but that was a long time ago.] From the Introduction: The specific issues addressed are the SGML document character set to be used for HTML, the proper treatment of the charset parameter associated with the "text/html" content type and the specification of some additional elements and entities. 1.2.2. User agents In addition to the requirements of RFC 1866, the following requirements are placed on HTML user agents. To ensure interoperability and proper support for at least ISO- 8859-1 in an environment where character encoding schemes other than ISO-8859-1 are present, user agents MUST correctly interpret the charset parameter accompanying an HTML document received from the network. ... [At the time, there were some user agents that fell over when there was any charset parameter in a HTTP Content-Type: header at all; the above was written to make sure that user agents that followed RFC 2070 would avoid just falling over. My understanding is that this problem was corrected in version 3 or so of Netscape and IE, or anyway in a timeframe that makes in irrelevant for our current spec.] 2.1. Reference processing model ... For the HTTP protocol [RFC2068], the external character encoding is indicated by the "charset" parameter of the "Content-Type" field of the header of an HTTP response. For example, to indicate that the transmitted document is encoded in the "JUNET" encoding of Japanese [RFC1468], the header will contain the following line: Content-Type: text/html; charset=ISO-2022-JP ... Similarly, if HTML documents are transferred by electronic mail, the external character encoding is defined by the "charset" parameter of the "Content-Type" MIME header field [RFC2045], and defaults to US- ASCII in its absence. [The US-ASCII default is specifically for email; there is no default in RFC 2070 for HTTP, for various good reasons, please see separate mail.] ... 6. External character encoding issues Proper interpretation of a text document requires that the character encoding scheme be known. Current HTTP servers, however, do not generally include an appropriate charset parameter with the Content- Type header. This is bad behaviour, which is even encouraged by the continued existence of browsers that declare an unrecognized media type when they receive a charset parameter. User agent implementators are strongly encouraged to make their software tolerant of this parameter, even if they cannot take advantage of it. [the (hopefully historical) issue mentioned above] ... In the case where a document is accessed from a hyperlink in an origin HTML document, a CHARSET attribute is added to the attribute list of elements with link semantics (A and LINK), specifically by adding it to the linkExtraAttributes entity. The value of that attribute is to be considered a hint to the User Agent as to the character encoding scheme used by the resource pointed to by the hyperlink; it should be the appropriate value of the MIME charset parameter for that resource. [not sure how much this is implemented or in use; it's not directly a HTTP issue] In any document, it is possible to include an indication of the encoding scheme like the following, as early as possible within the HEAD of the document: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-2022-JP"> This is not foolproof, but will work if the encoding scheme is such that ASCII-valued octets stand for ASCII characters only at least until the META element is parsed. [This is very, very widely used. As far as it's HTML, it's nothing HTTP should be concerned, but it is highly relevant for HTTP because it is dead straight against any default on the charset parameter in HTTP.] Note that there are better ways for a server to obtain character encoding information, instead of the unreliable META above; see [NICOL2] for some details and a proposal. [Yes indeed, servers never tried to get the charset info out of the document itself, although (as far as I have been told) that was the purpose for which this variant of the <meta> syntax had been designed for.] For definiteness, the "charset" parameter received from the source of the document should be considered the most authoritative, followed in order of preference by the contents of a META element such as the above, and finally the CHARSET parameter of the anchor that was followed (if any). Not everybody likes this, but it has been around for a long time, as far as I understand, it's respected by most browsers, and changing it won't make things better (it's just that the other half of the people involved will be unhappy). Regards, Martin. At 03:41 08/03/26, Frank Ellermann wrote: > >Simon Perreault wrote: > >> I investigated and came to the conclusion that MIME doesn't >> specify that text/* has a default character set of ASCII. >> I may very well be wrong, and this email is also a way to >> ask you for clarifications. > >RFC 2045 chapter 5: >| For example, the "charset" parameter is applicable to any >| subtype of "text" > >RFC 2045 chapter 5.2: >| Content-type: text/plain; charset=us-ascii >| >| This default is assumed if no Content-Type header field is >| specified. > >RFC 2046 chapter 4.1.2: >| Note that the character set used, if anything other than >| US-ASCII, must always be explicitly specified in the >| Content-Type field. > >There are additional remarks about treating unknown text/* >subtypes like text/plain, and for text/plain the default >US-ASCII is very clear (various places including RFC 2049, >minimal MIME conformance). > >You can argue that text/html is not "unknown" for a typical >HTTP application. *Historic* info: > >RFC 2070, chapter 2.1: >| HTML, as an application of SGML, does not directly address >| the question of the external character encoding. This is >| deferred to mechanisms external to HTML, such as MIME as >| used by the HTTP protocol or by electronic mail. >[...] >| Similarly, if HTML documents are transferred by electronic >| mail, the external character encoding is defined by the >| "charset" parameter of the "Content-Type" MIME header field >| [RFC2045], and defaults to US-ASCII in its absence. > >It's fun to see how RFC 2070 avoids to mention the Latin-1 >default in RFC 2068 :-) > >> It is also mentioned that other text/* types may not even >> have the charset parameter. So a default value would be >> meaningless. > >Yes, I'm too lazy to dig in the IANA registry, are you aware >of a text/* subtype with this property ? > >> What I understand from this section is that there is no >> default character set for text/*. > >Not the only possible interpretation, but without doubt the >dubious text/* MIME default is *NOT* Latin-1, for text/html >in RFC 2070 it is also *NOT* Latin-1. RFC 2854 is clearer: > >RFC 2854 chapter 6 (about text/html): >| The use of an explicit charset parameter is strongly >| recommended. While [MIME] specifies "The default character >| set, which must beassumed in the absence of a charset >| parameter, is US-ASCII." [HTTP] Section 3.7.1, defines >| that "media subtypes of the 'text' type are defined to have >| a default charset value of 'ISO-8859-1'". Section 19.3 of >| [HTTP] gives additional guidelines. Using an explicit >| charset parameter will help avoid confusion. > > Frank #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 26 March 2008 10:46:18 UTC