- From: Masayasu Ishikawa <mimasa@w3.org>
- Date: Thu, 05 Oct 2000 06:41:13 +0900
- To: bertilow@hem.passagen.se
- Cc: www-validator@w3.org, XHTML-L@egroups.com
"Bertilo Wennergren" <bertilow@hem.passagen.se> wrote:
> What about XHTML (and other XML document types)?
As far as XHTML 1.0 served as text/html is concerned, see RFC 2854,
"6. Charset default rules":
The use of an explicit charset parameter is strongly recommended.
While [MIME] specifies "The default character set, which must be
assumed in the absence of a charset parameter, is US-ASCII." [HTTP]
Section 3.7.1, defines that "media subtypes of the 'text' type are
defined to have a default charset value of 'ISO-8859-1'". Section
19.3 of [HTTP] gives additional guidelines. Using an explicit
charset parameter will help avoid confusion.
Using an explicit charset parameter also takes into account that the
overwhelming majority of deployed browsers are set to use something
else than 'ISO-8859-1' as the default; the actual default is either a
corporate character encoding or character encodings widely deployed
in a certain national or regional community. For further
considerations, please also see Section 5.2 of [HTML40].
cf. http://www.ietf.org/rfc/rfc2854.txt
> According to XML rules
> such a doc, without an explicit encoding declaration, should be taken
> as UTF-8 or UTF-16 (automatically detected).
If an XML entity is transmitted via HTTP as text/xml, the charset
parameter is authoritative over the encoding declaration. RFC 2376,
"3.1 Text/xml Registration" says:
Optional parameters: charset
Although listed as an optional parameter, the use of the charset
parameter is STRONGLY RECOMMENDED, since this information can be
used by XML processors to determine authoritatively the character
encoding of the XML entity. The charset parameter can also be used
to provide protocol-specific operations, such as charset-based
content negotiation in HTTP. "UTF-8" [RFC-2279] is the
recommended value, representing the UTF-8 charset. UTF-8 is
supported by all conforming XML processors [REC-XML].
If the XML entity is transmitted via HTTP, which uses a MIME-like
mechanism that is exempt from the restrictions on the text top-
level type (see section 19.4.1 of HTTP 1.1 [RFC-2068]), "UTF-16"
(Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is also
recommended. UTF-16 is supported by all conforming XML processors
[REC-XML]. Since the handling of CR, LF and NUL for text types in
most MIME applications would cause undesired transformations of
individual octets in UTF-16 multi-octet characters, gateways from
HTTP to these MIME applications MUST transform the XML entity from
a text/xml; charset="utf-16" to application/xml; charset="utf-16".
Conformant with [RFC-2046], if a text/xml entity is received with
the charset parameter omitted, MIME processors and XML processors
MUST use the default charset value of "us-ascii". In cases where
the XML entity is transmitted via HTTP, the default charset value
is still "us-ascii".
cf. http://www.ietf.org/rfc/rfc2376.txt
So unless an explicit charset parameter is provided, it MUST be treated
as US-ASCII. Note that RFC 2376 is now under revision, but this part
is basically the same.
cf. http://www.ietf.org/internet-drafts/draft-murata-xml-09.txt
> Do we have a clash between
> two different rule sets here? Does it matter if XHTML is served as "text/xml"
> or "text/html"?
In any case, an explicit (and of course correct) charset parameter
avoids confusion.
> Would the rules for encodings, http versus in-doc declarations,
> be different? If the http charset parameter says one thing, and the in-doc
> declaration says another thing, which one should take precedence?
Both the HTML 4 spec and RFC 2376 say that the charset parameter is
authoritative. That's the same in the case of XHTML.
> According
> to the XHTML spec encoding info in an XML declaration takes precedence over
> meta-element charset info, but does it win over true http charset info as
> well?
No.
> The current practice is to let meta charset info win over true http
> charset info, which might be in violation of the rules. This is confusing
> already. Bringing in XML declarations (and the default encoding when there
> is no XML declaration, or when there is no encoding attribute in the
> XML declaration) makes this even more confusing.
Note that if an XML entity is transmitted via HTTP as application/xml
without an explicit charset parameter, another default rule applies.
"3.2 Application/xml Registration" of RFC 2376 specifies:
Optional parameters: charset
Although listed as an optional parameter, the use of the charset
parameter is STRONGLY RECOMMENDED, since this information can be
used by XML processors to determine authoritatively the charset of
the XML entity. The charset parameter can also be used to provide
protocol-specific operations, such as charset-based content
negotiation in HTTP.
"UTF-8" [RFC-2279] and "UTF-16" (Appendix C.3 of [UNICODE] and
Amendment 1 of [ISO-10646]) are the recommended values,
representing the UTF-8 and UTF-16 charsets, respectively. These
charsets are preferred since they are supported by all conforming
XML processors [REC-XML].
If an application/xml entity is received where the charset
parameter is omitted, no information is being provided about the
charset by the MIME Content-Type header. Conforming XML processors
MUST follow the requirements in section 4.3.3 of [REC-XML] which
directly address this contingency. However, MIME processors which
are not XML processors should not assume a default charset if the
charset parameter is omitted from an application/xml entity.
> I've been wondering about this for a long time. I'd like to find clear
> rules based on understandable logic, but I haven't found that yet.
> Any hope?
Hope this helps, though, I understand that this is quite complicated.
Regards,
--
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium
Received on Wednesday, 4 October 2000 17:41:31 UTC