Re: hex FFFE at the start of UTF-16 stream

Masayasu Ishikawa (in <www-validator@w3.org>):

> While the validator's script needs to be updated to handle UTF-16
> correctly (nsgmls can handle UTF-16 if it is configured appropriately,
> and indeed the above page is validated if I run nsgmls locally), you
> have to configure your Web server to add a correct charset parameter
> to the Content-Type HTTP response header, i.e.
> 
>     Content-Type: text/html; charset=UTF-16
> 
> Your server only sends
> 
>     Content-Type: text/html
> 
> and it cannot be handled correctly even if the validator can handle
> UTF-16.  RFC 2616, "3.7.1 Canonicalization and Text Defaults" says:
> 
>     The "charset" parameter is used with some media types to define the
>     character set (section 3.4) of the data. When no explicit charset
>     parameter is provided by the sender, media subtypes of the "text"
>     type are defined to have a default charset value of "ISO-8859-1" when
>     received via HTTP. Data in character sets other than "ISO-8859-1" or
>     its subsets MUST be labeled with an appropriate charset value. See
>     section 3.4.1 for compatibility problems.
> 
>     cf. http://www.ietf.org/rfc/rfc2616.txt

What about XHTML (and other XML document types)? According to XML rules
such a doc, without an explicit encoding declaration, should be taken
as UTF-8 or UTF-16 (automatically detected). Do we have a clash between
two different rule sets here? Does it matter if XHTML is served as "text/xml"
or "text/html"? Would the rules for encodings, http versus in-doc declarations,
be different? If the http charset parameter says one thing, and the in-doc 
declaration says another thing, which one should take precedence? According
to the XHTML spec encoding info in an XML declaration takes precedence over
meta-element charset info, but does it win over true http charset info as
well?

The current practice is to let meta charset info win over true http
charset info, which might be in violation of the rules. This is confusing
already. Bringing in XML declarations (and the default encoding when there
is no XML declaration, or when there is no encoding attribute in the
XML declaration) makes this even more confusing.

I've been wondering about this for a long time. I'd like to find clear
rules based on understandable logic, but I haven't found that yet.
Any hope?

#####################################################################
                         Bertilo Wennergren
                 <http://purl.oclc.org/net/bertilo>
                     <bertilow@hem.passagen.se>
#####################################################################

Received on Wednesday, 4 October 2000 16:44:06 UTC