Re: UTF-16BL/LE,... (was: Re: I18N issues with the XML Specification

>For the record, and this will come as no surprise, I totally oppose this 
>change, because I do *not* think 16LE and 16BE are appropriate for use with
>XML, as they fly in the face of XML's orientation towards interoperability
>across heterogeneous systems.  I think XML entities encoded in any flavor
>of UTF-16 should always have a BOM; exactly what the current spec [correctly
>IMHO] says.

Should it be considered an error if it doesn't contain a BOM?  IMHO, in the
absence of a BOM, UTF-16BE should be assumed.  If the charset declaration
and BOM disagree, it is a fatal error.  In case of UTF-16 declaration, the
BOM determines which one of UTF-16LE or UTF-16BE it is.  Including these
names (UTF-16LE & UTF-16BE) in the charset name possibility just adds more
wrinkles and probably more confusions and definitely more errors.  I
thoroughly disapprove of the LE and BE suffixes.

RFC2781 makes it an error to have a BOM in case of UTF-16LE or UTF-16BE
charset declaration.  Why should it be such especially if there is no
contradiction?  RFC2781 also says:

"Text labelled "UTF-16LE" can always be interpreted as being little-
   endian. The detection of an initial BOM does not affect de-
   serialization of text labelled as UTF-16LE. Finding 0xFE followed by
   0xFF is an error since there is no Unicode character 0xFFFE, which
   would be the interpretation of those octets under little-endian
   order."

Well, FEFF is not being interpreted as a character but as a mark which is
very different.  But interestingly enough, FEFF is allowed in case UTF-16
is the charset declaration.  


Sujatha.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sujatha N. Marsden
Chief Scientist
eTranslate, Inc.

Received on Wednesday, 12 April 2000 17:52:35 UTC