- From: MURATA Makoto <muraw3c@attglobal.net>
- Date: Wed, 05 Apr 2000 10:56:37 +0900
- To: xml-editor@w3.org, w3c-i18n-ig@w3.org
- Cc: w3c-xml-core-wg@w3.org
In message "Re: I18N issues with the XML Specification", Rick Jelliffe wrote... > >Why is it true that external parsed entities in UTF-16 may begin with any >character? That is a bug which should be fixed up. In the absense of >overriding higher-level out-of-band signalling, an XML entity must be >required to identify its encoding unambiguously. The wrong thing to do >would be to say "Autodetection is unreliable"--it must be reliable, and >the rest of XML 1.0 must not have anything that prevents it from being >reliable. > >To put it another way, if a character encoding cannot reliably be >autodetected, it should be banned from being used with XML. But I have >still yet to find any encodings that fit into this category. In RFC 2781 (UTF-16, an encoding of ISO 10646), we have three dialects of UTF-16. Their charset names are "utf-16", "utf-16le" (BOM-less little endian), and "utf-16be" (BOM-less big endian). "3.3 Choosing a label for UTF-16 text Any labelling application that uses UTF-16 character encoding, and explicitly labels the text, and knows the serialization order of the characters in text, SHOULD label the text as either "UTF-16BE" or "UTF-16LE", whichever is appropriate based on the endianness of the text. This allows applications processing the text, but unable to look inside the text, to know the serialization definitively. Text in the "UTF-16BE" charset MUST be serialized with the octets which make up a single 16-bit UTF-16 value in big-endian order. Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text. Text in the "UTF-16LE" charset MUST be serialized with the octets which make up a single 16-bit UTF-16 value in little-endian order. Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text. Any labelling application that uses UTF-16 character encoding, and puts an explicit charset label on the text, and does not know the serialization order of the characters in text, MUST label the text as "UTF-16", and SHOULD make sure the text starts with 0xFEFF. An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE" would occur with document formats that mandate a BOM in UTF-16 text, thereby requiring the use of the "UTF-16" tag only." ----- http://www.ietf.org/rfc/rfc2781.txt ---- Some people strongly believe that UTF-16LE and UTF-16BE should be allowed in XML. In fact this is the consensus in the lateset F2F of the I18N WG as below: "Charsets UTF-16BE and UTF-16LE We agreed to facilitate the use of these charsets with XML." ----- http://www.w3.org/International/Group/issues/xml/#utf16.be.le ---- Others believe that the BOM must be mandatory for XML in UTF-16; that is, UTF-16le and UTF-16be (dialects of UTF-16 without the BOM) cannot be used for XML. In my understanding, this is the position of XML 1.0 In 4.3.3. of the XML 1.0 recommendation, we have the following: "Entities encoded in UTF-16 must begin with the Byte Order Mark described by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents." --- http://www.w3.org/TR/1998/REC-xml-19980210#charencoding -- When this text was written, charset names "utf-16le" and "utf-16be" did not exist. Thus, "in UTF-16" was meant to reference to UTF-16 in general. RFC 2376 (XML media types) clearly mandates the BOM as below: "5 The Byte Order Mark (BOM) and Conversions to/from UTF-16 The XML Recommendation, in section 4.3.3, specifies that UTF-16 XML entities must begin with a byte order mark (BOM), which is the ZERO WIDTH NO-BREAK SPACE character, hexadecimal sequence 0xFEFF (or 0xFFFE, depending on endian). The XML Recommendation further states that the BOM is an encoding signature, and is not part of either the markup or the character data of the XML document. Due to the BOM, applications which convert XML from the UTF-16 encoding to another encoding SHOULD strip the BOM before conversion. Similarly, when converting from another encoding into UTF-16, the BOM SHOULD be added after conversion is complete." ----- http://www.ietf.org/rfc/rfc2376.txt ---- There have been some discussion in the IETF-XML-MIME ML recently. The thread begins with Tim Bray's message as below: "Thus in my view the RFC is correct, and thus 16BE and 16LE are not useful for XML. It is good practice, whenever you store anything in UTF-16, to put a BOM in, and XML makes that good practice compulsory, which is pretty painless since it seems that virtually all software that writes UTF-16 does so anyhow. The cost of a BOM is zilch. The benefit in data survival in the face of stupid byte order tricks (yes, they still happen), is immense." ----- http://www.imc.org/ietf-xml-mime/mail-archive/msg00513.html --- There have been a number of discussion in the XML Syntax WG. The thread can be traced from the fowllowing message: http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999Feb/0126.html http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999Mar/0001.html Now, let us suppose that we allow UTF-16LE/BE for XML. Then, what will happen? XML document entities, external parameter entities, and external DTD subsets begin with predictable character sequences such as "<?xml". However, external parsed entities are allowed to begin with *any* character. Therefore, if the BOM is absent, we cannot reliably detect UTF-16 external parsed entities. One way to solve this problem is to mandate encoding declarations for UTF-16LE/BE XML. I think that this is a substantial change to XML 1.0, and thus requires a new version number. Let's go back to the sentence in question. "Note: Since external parsed entities in UTF-16 may begin with any character, this autodetection does not always work." in E44. If we decide to allow UTF-16LE/BE for XML, we have to publish a new RFC that supersedes RFC 2376, and to publish a new version of XML. Then, the sentence should be deleted and the autodetection algorithm should be significantly revised so as to handle encoding declarations in UTF-16LE/BE correctly. If we decide to disallow UTF-16LE/BE for XML, we can simply delete the sentence or may want to revise is as below: When external parsed entities are encoded in UTF-16LE/BE (and thus, strictly speaking, in error), this autodetection does not work. Now, my two cents. I personally would like to mandate the BOM and to disallow UTF-16LE/BE for XML. I have never seen UTF-16LE/BE XML. I do not believe users will care to put <?xml encoding="utf-16le"?> or <?xml encoding="utf-16be"?>. Hope this helps. Cheers, ---- MURATA Makoto muraw3c@attglobal.net
Received on Tuesday, 4 April 2000 21:56:35 UTC