Encoding determination when parse="text"

This is a comment on the 10 November 2003 Last Call WD at 
http://www.w3.org/TR/2003/WD-xinclude-20031110.

In section 4.3 Included Items when parse="text", the encoding of the 
included resource is said to be determined as follows:

===============
The encoding of such a resource is determined by:

     *  external encoding information, if available, otherwise

     * if the media type of the resource is text/xml, application/xml, 
or matches the conventions text/*+xml or application/*+xml as described 
in XML Media Types [IETF RFC 3023], the encoding is recognized as 
specified in XML 1.0, otherwise

     * the value of the encoding attribute if one exists, otherwise

     * UTF-8.
===============

This suffers from a couple of weaknesses:

1) "external encoding information" is insufficiently specified.  The XML 
spec says "external character encoding information (such as MIME 
headers)" in section 4.3.3.  This would an improvement, or better yet 
"(such as MIME or HTTP headers)" since HTTP is not, strictly speaking, 
MIME-compliant.

2) Formats other than XML allow determination of encoding from embedded 
information (e.g. @charset in CSS).  XInclude processors should be 
allowed to take advantage of such information.  The current list of 
sources for encoding info above excludes that.

3) For encodings of Unicode, the BOM (the character U+FEFF) provides a 
highly generic (independent of media type, as long as it is text) 
mechanism for determining the encoding.  This mechanism should be 
mandated for the two Unicode encodings (UTF-8 and UTF-16) that XML 
processors must support.  It should be allowed for other Unicode encodings.

Here is a suggested rewrite of the spec fragment quoted above:

===============
The encoding of such a resource is determined by:

     * external encoding information (such as MIME or HTTP headers), if 
available, otherwise

     * if the media type of the resource is text/xml, application/xml, 
or matches the conventions text/*+xml or application/*+xml as described 
in XML Media Types [IETF RFC 3023], the encoding is recognized as 
specified in XML 1.0, otherwise

     * for other media type, by the detection of an initial UCS 
signature (a.k.a. BOM, the character U+FEFF), if present.  The byte 
patterns for U+FEFF in the UTF-8 and UTF-16 encodings must be recognized 
and used for encoding determination, byte patterns for other Unicode 
encodings may be recognized and used. Otherwise

     * the value of the encoding attribute if one exists, otherwise

     * internal encoding information (such as @charset in CSS) for media 
types that the processor recognizes, otherwise

     * UTF-8.
===============

-- 
François Yergeau

Received on Friday, 5 December 2003 14:50:46 UTC