- From: François Yergeau <francois@yergeau.com>
- Date: Fri, 05 Dec 2003 14:39:16 -0500
- To: www-xml-xinclude-comments@w3.org
This is a comment on the 10 November 2003 Last Call WD at http://www.w3.org/TR/2003/WD-xinclude-20031110. In section 4.3 Included Items when parse="text", the encoding of the included resource is said to be determined as follows: =============== The encoding of such a resource is determined by: * external encoding information, if available, otherwise * if the media type of the resource is text/xml, application/xml, or matches the conventions text/*+xml or application/*+xml as described in XML Media Types [IETF RFC 3023], the encoding is recognized as specified in XML 1.0, otherwise * the value of the encoding attribute if one exists, otherwise * UTF-8. =============== This suffers from a couple of weaknesses: 1) "external encoding information" is insufficiently specified. The XML spec says "external character encoding information (such as MIME headers)" in section 4.3.3. This would an improvement, or better yet "(such as MIME or HTTP headers)" since HTTP is not, strictly speaking, MIME-compliant. 2) Formats other than XML allow determination of encoding from embedded information (e.g. @charset in CSS). XInclude processors should be allowed to take advantage of such information. The current list of sources for encoding info above excludes that. 3) For encodings of Unicode, the BOM (the character U+FEFF) provides a highly generic (independent of media type, as long as it is text) mechanism for determining the encoding. This mechanism should be mandated for the two Unicode encodings (UTF-8 and UTF-16) that XML processors must support. It should be allowed for other Unicode encodings. Here is a suggested rewrite of the spec fragment quoted above: =============== The encoding of such a resource is determined by: * external encoding information (such as MIME or HTTP headers), if available, otherwise * if the media type of the resource is text/xml, application/xml, or matches the conventions text/*+xml or application/*+xml as described in XML Media Types [IETF RFC 3023], the encoding is recognized as specified in XML 1.0, otherwise * for other media type, by the detection of an initial UCS signature (a.k.a. BOM, the character U+FEFF), if present. The byte patterns for U+FEFF in the UTF-8 and UTF-16 encodings must be recognized and used for encoding determination, byte patterns for other Unicode encodings may be recognized and used. Otherwise * the value of the encoding attribute if one exists, otherwise * internal encoding information (such as @charset in CSS) for media types that the processor recognizes, otherwise * UTF-8. =============== -- François Yergeau
Received on Friday, 5 December 2003 14:50:46 UTC