- From: François Yergeau <francois@yergeau.com>
- Date: Fri, 05 Dec 2003 14:39:16 -0500
- To: www-xml-xinclude-comments@w3.org
This is a comment on the 10 November 2003 Last Call WD at
http://www.w3.org/TR/2003/WD-xinclude-20031110.
In section 4.3 Included Items when parse="text", the encoding of the
included resource is said to be determined as follows:
===============
The encoding of such a resource is determined by:
* external encoding information, if available, otherwise
* if the media type of the resource is text/xml, application/xml,
or matches the conventions text/*+xml or application/*+xml as described
in XML Media Types [IETF RFC 3023], the encoding is recognized as
specified in XML 1.0, otherwise
* the value of the encoding attribute if one exists, otherwise
* UTF-8.
===============
This suffers from a couple of weaknesses:
1) "external encoding information" is insufficiently specified. The XML
spec says "external character encoding information (such as MIME
headers)" in section 4.3.3. This would an improvement, or better yet
"(such as MIME or HTTP headers)" since HTTP is not, strictly speaking,
MIME-compliant.
2) Formats other than XML allow determination of encoding from embedded
information (e.g. @charset in CSS). XInclude processors should be
allowed to take advantage of such information. The current list of
sources for encoding info above excludes that.
3) For encodings of Unicode, the BOM (the character U+FEFF) provides a
highly generic (independent of media type, as long as it is text)
mechanism for determining the encoding. This mechanism should be
mandated for the two Unicode encodings (UTF-8 and UTF-16) that XML
processors must support. It should be allowed for other Unicode encodings.
Here is a suggested rewrite of the spec fragment quoted above:
===============
The encoding of such a resource is determined by:
* external encoding information (such as MIME or HTTP headers), if
available, otherwise
* if the media type of the resource is text/xml, application/xml,
or matches the conventions text/*+xml or application/*+xml as described
in XML Media Types [IETF RFC 3023], the encoding is recognized as
specified in XML 1.0, otherwise
* for other media type, by the detection of an initial UCS
signature (a.k.a. BOM, the character U+FEFF), if present. The byte
patterns for U+FEFF in the UTF-8 and UTF-16 encodings must be recognized
and used for encoding determination, byte patterns for other Unicode
encodings may be recognized and used. Otherwise
* the value of the encoding attribute if one exists, otherwise
* internal encoding information (such as @charset in CSS) for media
types that the processor recognizes, otherwise
* UTF-8.
===============
--
François Yergeau
Received on Friday, 5 December 2003 14:50:46 UTC