Re: I18N issues with the XML Specification from MURATA Makoto on 2000-04-05 (xml-editor@w3.org from April to June 2000)

From: MURATA Makoto <muraw3c@attglobal.net>
Date: Wed, 05 Apr 2000 10:56:37 +0900
To: xml-editor@w3.org, w3c-i18n-ig@w3.org
Cc: w3c-xml-core-wg@w3.org
Message-Id: <200004050156.AA02197@t3knz.attglobal.net>
In message "Re: I18N issues with the XML Specification",
Rick Jelliffe wrote...
 >
 >Why is it true that external parsed entities in UTF-16 may begin with any
 >character? That is a bug which should be fixed up. In the absense of
 >overriding higher-level out-of-band signalling, an XML entity must be
 >required to identify its encoding unambiguously.  The wrong thing to do
 >would be to say "Autodetection is unreliable"--it must be reliable, and
 >the rest of XML 1.0 must not have anything that prevents it from being
 >reliable. 
 >
 >To put it another way, if a character encoding cannot reliably be
 >autodetected, it should be banned from being used with XML. But I have
 >still yet to find any encodings that fit into this category. 

In RFC 2781 (UTF-16, an encoding of ISO 10646), we have three dialects 
of UTF-16.  Their charset names are "utf-16", "utf-16le" (BOM-less 
little endian), and "utf-16be" (BOM-less big endian).

"3.3 Choosing a label for UTF-16 text

   Any labelling application that uses UTF-16 character encoding, and
   explicitly labels the text, and knows the serialization order of the
   characters in text, SHOULD label the text as either "UTF-16BE" or
   "UTF-16LE", whichever is appropriate based on the endianness of the
   text. This allows applications processing the text, but unable to
   look inside the text, to know the serialization definitively.

   Text in the "UTF-16BE" charset MUST be serialized with the octets
   which make up a single 16-bit UTF-16 value in big-endian order.
   Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.

   Text in the "UTF-16LE" charset MUST be serialized with the octets
   which make up a single 16-bit UTF-16 value in little-endian order.
   Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.

   Any labelling application that uses UTF-16 character encoding, and
   puts an explicit charset label on the text, and does not know the
   serialization order of the characters in text, MUST label the text as
   "UTF-16", and SHOULD make sure the text starts with 0xFEFF.

   An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"
   would occur with document formats that mandate a BOM in UTF-16 text,
   thereby requiring the use of the "UTF-16" tag only."

  ----- http://www.ietf.org/rfc/rfc2781.txt ----


Some people strongly believe that UTF-16LE and UTF-16BE should be 
allowed in XML.  In fact this is the consensus in the lateset F2F 
of the I18N WG as below:

"Charsets UTF-16BE and UTF-16LE

We agreed to facilitate the use of these charsets with XML."

 ----- http://www.w3.org/International/Group/issues/xml/#utf16.be.le  ----

Others believe that the BOM must be mandatory for XML in UTF-16; that 
is, UTF-16le and UTF-16be (dialects of UTF-16 without the BOM) cannot 
be used for XML.  

In my understanding, this is the position of XML 1.0  In 4.3.3. of the 
XML 1.0 recommendation, we have the following:

"Entities encoded in UTF-16 must begin with the Byte Order Mark described by 
ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK SPACE 
character, #xFEFF). This is an encoding signature, not part of either the 
markup or the character data of the XML document. XML processors must be able 
to use this character to differentiate between UTF-8 and UTF-16 encoded 
documents."

--- http://www.w3.org/TR/1998/REC-xml-19980210#charencoding --

When this text was written, charset names "utf-16le" and "utf-16be" 
did not exist.  Thus, "in UTF-16" was meant to reference to UTF-16 in 
general.

RFC 2376 (XML media types) clearly mandates the BOM as below:

"5  The Byte Order Mark (BOM) and Conversions to/from UTF-16

   The XML Recommendation, in section 4.3.3, specifies that UTF-16 XML
   entities must begin with a byte order mark (BOM), which is the ZERO
   WIDTH NO-BREAK SPACE character, hexadecimal sequence 0xFEFF (or
   0xFFFE, depending on endian). The XML Recommendation further states
   that the BOM is an encoding signature, and is not part of either the
   markup or the character data of the XML document.

   Due to the BOM, applications which convert XML from the UTF-16
   encoding to another encoding SHOULD strip the BOM before conversion.
   Similarly, when converting from another encoding into UTF-16, the BOM
   SHOULD be added after conversion is complete."

  ----- http://www.ietf.org/rfc/rfc2376.txt ----

There have been some discussion in the IETF-XML-MIME ML recently.  
The thread begins with Tim Bray's message as below:

"Thus in my view the RFC is correct, and thus 16BE and 16LE are not useful
for XML.  It is good practice, whenever you store anything in UTF-16, to 
put a BOM in, and XML makes that good practice compulsory, which is pretty 
painless since it seems that virtually all software that writes UTF-16 does 
so anyhow. The cost of a BOM is zilch.  The benefit in data survival in the 
face of stupid byte order tricks (yes, they still happen), is immense."

  ----- http://www.imc.org/ietf-xml-mime/mail-archive/msg00513.html ---

There have been a number of discussion in the XML Syntax WG.  The thread 
can be traced from the fowllowing message:

http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999Feb/0126.html
http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999Mar/0001.html

Now, let us suppose that we allow UTF-16LE/BE for XML.  Then, what 
will happen?  XML document entities, external parameter entities, and 
external DTD subsets begin with predictable character sequences 
such as "<?xml".  However, external parsed entities are allowed to 
begin with *any* character.  Therefore, if the BOM is absent, we 
cannot reliably detect UTF-16 external parsed entities.

One way to solve this problem is to mandate encoding declarations 
for UTF-16LE/BE XML.  I think that this is a substantial change 
to XML 1.0, and thus requires a new version number.

Let's go back to the sentence in question.  "Note: Since external 
parsed entities in UTF-16 may begin with any character, this 
autodetection does not always work." in E44.

If we decide to allow UTF-16LE/BE for XML, we have to publish 
a new RFC that supersedes RFC 2376, and to publish a new version 
of XML.  Then, the sentence should be deleted and the autodetection 
algorithm should be significantly revised so as to handle 
encoding declarations in UTF-16LE/BE correctly.

If we decide to disallow UTF-16LE/BE for XML, we can simply 
delete the sentence or may want to revise is as below:

	When external parsed entities are encoded in UTF-16LE/BE (and thus, 
	strictly speaking, in error), this autodetection does not work.

Now, my two cents.  I personally would like to mandate the BOM and 
to disallow UTF-16LE/BE for XML.  I have never seen UTF-16LE/BE XML.  
I do not believe users will care to put <?xml encoding="utf-16le"?> 
or <?xml encoding="utf-16be"?>.  

Hope this helps.

Cheers,


----
MURATA Makoto  muraw3c@attglobal.net
Received on Tuesday, 4 April 2000 21:56:35 UTC