Re: Default charsets for text media types [i20]

Simon Perreault wrote:
 
> I investigated and came to the conclusion that MIME doesn't
> specify that text/* has a default character set of ASCII.
> I may very well be wrong, and this email is also a way to
> ask you for clarifications.

RFC 2045 chapter 5:
| For example, the "charset" parameter is applicable to any
| subtype of "text"

RFC 2045 chapter 5.2:
|    Content-type: text/plain; charset=us-ascii
|
| This default is assumed if no Content-Type header field is
| specified.

RFC 2046 chapter 4.1.2:
| Note that the character set used, if anything other than
| US-ASCII, must always be explicitly specified in the 
| Content-Type field.

There are additional remarks about treating unknown text/*
subtypes like text/plain, and for text/plain the default
US-ASCII is very clear (various places including RFC 2049,
minimal MIME conformance).

You can argue that text/html is not "unknown" for a typical
HTTP application.  *Historic* info:

RFC 2070, chapter 2.1:
| HTML, as an application of SGML, does not directly address
| the question of the external character encoding. This is
| deferred to mechanisms external to HTML, such as MIME as
| used by the HTTP protocol or by electronic mail.
[...]
| Similarly, if HTML documents are transferred by electronic
| mail, the external character encoding is defined by the 
| "charset" parameter of the "Content-Type" MIME header field
| [RFC2045], and defaults to US-ASCII in its absence.

It's fun to see how RFC 2070 avoids to mention the Latin-1
default in RFC 2068 :-)

> It is also mentioned that other text/* types may not even
> have the charset parameter. So a default value would be 
> meaningless.

Yes, I'm too lazy to dig in the IANA registry, are you aware
of a text/* subtype with this property ?

> What I understand from this section is that there is no 
> default character set for text/*.

Not the only possible interpretation, but without doubt the 
dubious text/* MIME default is *NOT* Latin-1, for text/html
in RFC 2070 it is also *NOT* Latin-1.  RFC 2854 is clearer:

RFC 2854 chapter 6 (about text/html):
| The use of an explicit charset parameter is strongly 
| recommended.  While [MIME] specifies "The default character
| set, which must beassumed in the absence of a charset
| parameter, is US-ASCII."  [HTTP] Section 3.7.1, defines
| that "media subtypes of the 'text' type are defined to have
| a default charset value of 'ISO-8859-1'".  Section 19.3 of
| [HTTP] gives additional guidelines.  Using an explicit
| charset parameter will help avoid confusion.

 Frank

Received on Tuesday, 25 March 2008 18:39:57 UTC