Re: hex FFFE at the start of UTF-16 stream

"Bertilo Wennergren" <bertilow@hem.passagen.se> wrote:

> What about XHTML (and other XML document types)?

As far as XHTML 1.0 served as text/html is concerned, see RFC 2854,
"6. Charset default rules":

    The use of an explicit charset parameter is strongly recommended.
    While [MIME] specifies "The default character set, which must be
    assumed in the absence of a charset parameter, is US-ASCII."  [HTTP]
    Section 3.7.1, defines that "media subtypes of the 'text' type are
    defined to have a default charset value of 'ISO-8859-1'".  Section
    19.3 of [HTTP] gives additional guidelines.  Using an explicit
    charset parameter will help avoid confusion.

    Using an explicit charset parameter also takes into account that the
    overwhelming majority of deployed browsers are set to use something
    else than 'ISO-8859-1' as the default; the actual default is either a
    corporate character encoding or character encodings widely deployed
    in a certain national or regional community. For further
    considerations, please also see Section 5.2 of [HTML40].

    cf. http://www.ietf.org/rfc/rfc2854.txt

> According to XML rules
> such a doc, without an explicit encoding declaration, should be taken
> as UTF-8 or UTF-16 (automatically detected).

If an XML entity is transmitted via HTTP as text/xml, the charset
parameter is authoritative over the encoding declaration.  RFC 2376,
"3.1 Text/xml Registration" says:

   Optional parameters: charset

      Although listed as an optional parameter, the use of the charset
      parameter is STRONGLY RECOMMENDED, since this information can be
      used by XML processors to determine authoritatively the character
      encoding of the XML entity. The charset parameter can also be used
      to provide protocol-specific operations, such as charset-based
      content negotiation in HTTP.  "UTF-8" [RFC-2279] is the
      recommended value, representing the UTF-8 charset. UTF-8 is
      supported by all conforming XML processors [REC-XML].

      If the XML entity is transmitted via HTTP, which uses a MIME-like
      mechanism that is exempt from the restrictions on the text top-
      level type (see section 19.4.1 of HTTP 1.1 [RFC-2068]), "UTF-16"
      (Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]) is also
      recommended.  UTF-16 is supported by all conforming XML processors
      [REC-XML].  Since the handling of CR, LF and NUL for text types in
      most MIME applications would cause undesired transformations of
      individual octets in UTF-16 multi-octet characters, gateways from
      HTTP to these MIME applications MUST transform the XML entity from
      a text/xml; charset="utf-16" to application/xml; charset="utf-16".

      Conformant with [RFC-2046], if a text/xml entity is received with
      the charset parameter omitted, MIME processors and XML processors
      MUST use the default charset value of "us-ascii".  In cases where
      the XML entity is transmitted via HTTP, the default charset value
      is still "us-ascii".

    cf. http://www.ietf.org/rfc/rfc2376.txt

So unless an explicit charset parameter is provided, it MUST be treated
as US-ASCII.  Note that RFC 2376 is now under revision, but this part
is basically the same.

    cf. http://www.ietf.org/internet-drafts/draft-murata-xml-09.txt

> Do we have a clash between
> two different rule sets here? Does it matter if XHTML is served as "text/xml"
> or "text/html"?

In any case, an explicit (and of course correct) charset parameter
avoids confusion.

> Would the rules for encodings, http versus in-doc declarations,
> be different? If the http charset parameter says one thing, and the in-doc 
> declaration says another thing, which one should take precedence?

Both the HTML 4 spec and RFC 2376 say that the charset parameter is
authoritative.  That's the same in the case of XHTML.

> According
> to the XHTML spec encoding info in an XML declaration takes precedence over
> meta-element charset info, but does it win over true http charset info as
> well?

No.

> The current practice is to let meta charset info win over true http
> charset info, which might be in violation of the rules. This is confusing
> already. Bringing in XML declarations (and the default encoding when there
> is no XML declaration, or when there is no encoding attribute in the
> XML declaration) makes this even more confusing.

Note that if an XML entity is transmitted via HTTP as application/xml
without an explicit charset parameter, another default rule applies.
"3.2 Application/xml Registration" of RFC 2376 specifies:

   Optional parameters: charset

      Although listed as an optional parameter, the use of the charset
      parameter is STRONGLY RECOMMENDED, since this information can be
      used by XML processors to determine authoritatively the charset of
      the XML entity. The charset parameter can also be used to provide
      protocol-specific operations, such as charset-based content
      negotiation in HTTP.

      "UTF-8" [RFC-2279] and "UTF-16" (Appendix C.3 of [UNICODE] and
      Amendment 1 of [ISO-10646]) are the recommended values,
      representing the UTF-8 and UTF-16 charsets, respectively. These
      charsets are  preferred since they are supported by all conforming
      XML processors [REC-XML].

      If an application/xml entity is received where the charset
      parameter is omitted, no information is being provided about the
      charset by the MIME Content-Type header. Conforming XML processors
      MUST follow the requirements in section 4.3.3 of [REC-XML] which
      directly address this contingency. However, MIME processors which
      are not XML processors should not assume a default charset if the
      charset parameter is omitted from an application/xml entity.

> I've been wondering about this for a long time. I'd like to find clear
> rules based on understandable logic, but I haven't found that yet.
> Any hope?

Hope this helps, though, I understand that this is quite complicated.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Wednesday, 4 October 2000 17:41:31 UTC