- From: Masayasu Ishikawa <mimasa@w3.org>
- Date: Thu, 05 Oct 2000 06:41:13 +0900
- To: bertilow@hem.passagen.se
- Cc: www-validator@w3.org, XHTML-L@egroups.com
"Bertilo Wennergren" <bertilow@hem.passagen.se> wrote: > What about XHTML (and other XML document types)? As far as XHTML 1.0 served as text/html is concerned, see RFC 2854, "6. Charset default rules": The use of an explicit charset parameter is strongly recommended. While [MIME] specifies "The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII." [HTTP] Section 3.7.1, defines that "media subtypes of the 'text' type are defined to have a default charset value of 'ISO-8859-1'". Section 19.3 of [HTTP] gives additional guidelines. Using an explicit charset parameter will help avoid confusion. Using an explicit charset parameter also takes into account that the overwhelming majority of deployed browsers are set to use something else than 'ISO-8859-1' as the default; the actual default is either a corporate character encoding or character encodings widely deployed in a certain national or regional community. For further considerations, please also see Section 5.2 of [HTML40]. cf. http://www.ietf.org/rfc/rfc2854.txt > According to XML rules > such a doc, without an explicit encoding declaration, should be taken > as UTF-8 or UTF-16 (automatically detected). If an XML entity is transmitted via HTTP as text/xml, the charset parameter is authoritative over the encoding declaration. RFC 2376, "3.1 Text/xml Registration" says: Optional parameters: charset Although listed as an optional parameter, the use of the charset parameter is STRONGLY RECOMMENDED, since this information can be used by XML processors to determine authoritatively the character encoding of the XML entity. The charset parameter can also be used to provide protocol-specific operations, such as charset-based content negotiation in HTTP. "UTF-8" [RFC-2279] is the recommended value, representing the UTF-8 charset. UTF-8 is supported by all conforming XML processors [REC-XML]. If the XML entity is transmitted via HTTP, which uses a MIME-like mechanism that is exempt from the restrictions on the text top- level type (see section 19.4.1 of HTTP 1.1 [RFC-2068]), "UTF-16" (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is also recommended. UTF-16 is supported by all conforming XML processors [REC-XML]. Since the handling of CR, LF and NUL for text types in most MIME applications would cause undesired transformations of individual octets in UTF-16 multi-octet characters, gateways from HTTP to these MIME applications MUST transform the XML entity from a text/xml; charset="utf-16" to application/xml; charset="utf-16". Conformant with [RFC-2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii". In cases where the XML entity is transmitted via HTTP, the default charset value is still "us-ascii". cf. http://www.ietf.org/rfc/rfc2376.txt So unless an explicit charset parameter is provided, it MUST be treated as US-ASCII. Note that RFC 2376 is now under revision, but this part is basically the same. cf. http://www.ietf.org/internet-drafts/draft-murata-xml-09.txt > Do we have a clash between > two different rule sets here? Does it matter if XHTML is served as "text/xml" > or "text/html"? In any case, an explicit (and of course correct) charset parameter avoids confusion. > Would the rules for encodings, http versus in-doc declarations, > be different? If the http charset parameter says one thing, and the in-doc > declaration says another thing, which one should take precedence? Both the HTML 4 spec and RFC 2376 say that the charset parameter is authoritative. That's the same in the case of XHTML. > According > to the XHTML spec encoding info in an XML declaration takes precedence over > meta-element charset info, but does it win over true http charset info as > well? No. > The current practice is to let meta charset info win over true http > charset info, which might be in violation of the rules. This is confusing > already. Bringing in XML declarations (and the default encoding when there > is no XML declaration, or when there is no encoding attribute in the > XML declaration) makes this even more confusing. Note that if an XML entity is transmitted via HTTP as application/xml without an explicit charset parameter, another default rule applies. "3.2 Application/xml Registration" of RFC 2376 specifies: Optional parameters: charset Although listed as an optional parameter, the use of the charset parameter is STRONGLY RECOMMENDED, since this information can be used by XML processors to determine authoritatively the charset of the XML entity. The charset parameter can also be used to provide protocol-specific operations, such as charset-based content negotiation in HTTP. "UTF-8" [RFC-2279] and "UTF-16" (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) are the recommended values, representing the UTF-8 and UTF-16 charsets, respectively. These charsets are preferred since they are supported by all conforming XML processors [REC-XML]. If an application/xml entity is received where the charset parameter is omitted, no information is being provided about the charset by the MIME Content-Type header. Conforming XML processors MUST follow the requirements in section 4.3.3 of [REC-XML] which directly address this contingency. However, MIME processors which are not XML processors should not assume a default charset if the charset parameter is omitted from an application/xml entity. > I've been wondering about this for a long time. I'd like to find clear > rules based on understandable logic, but I haven't found that yet. > Any hope? Hope this helps, though, I understand that this is quite complicated. Regards, -- Masayasu Ishikawa / mimasa@w3.org W3C - World Wide Web Consortium
Received on Wednesday, 4 October 2000 17:41:31 UTC