encoding in XHTML

I'm curious about this extract from an appendix to the XHTML spec:

<quote>
C.9. Character Encoding
Historically, the character encoding of an HTML document is either 
specified by a web server via the charset parameter of the HTTP 
Content-Type header, or via a meta element in the document itself. In an XML document, the character 
encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably present documents with specific character 
encodings, the best approach is to ensure that the web server provides the 
correct headers. If this is not possible, a document that wants to set its 
character encoding explicitly must include both the XML declaration an 
encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />). In XHTML-conforming user agents, the value of the encoding declaration 
of the XML declaration takes precedence.
</quote>

This is said to be informative, yet the quoted text says, "...a document that wants to set its character encoding explicitly *must* 
include both the XML declaration an encoding declaration and a meta 
http-equiv statement..." (emphasis added). How can an informative portion 
of the document say that something *must* be done?

The bigger question is what really should or does happen. This issue was 
brought to my attention when I discovered that IE 6 would not interpret a 
certain xhtml doc in terms of UTF-8 unless we added the http-equiv 
statement, even though UTF-8 was explicitly declared as the encoding in 
the XML declaration. (It was assuming either 8859-1 or cp1252, I forget 
which.) It seems to me that this was a bug on the part of IE -- if it's 
interpreting an XML doc, it should pay attention to the encoding declared 
in the XML declaration.

In general, it seems to me that stronger statements should be made in the 
spec: XHTML is an XML application, and thus user agents must conform to 
the XML spec, implying that an encoding specified in the XML declaration 
*must* be observed -- and that this statement can be made normatively 
rather than just informatively. Am I missing something? Or is this being 
worked on further in the draft for version 2?



- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

Received on Sunday, 3 November 2002 07:36:02 UTC