W3C home > Mailing lists > Public > xml-editor@w3.org > April to June 2000

Re: UTF-16BL/LE,... (was: Re: I18N issues with the XML Specification

From: Sujatha N. Marsden <smarsden@etranslate.com>
Date: Wed, 12 Apr 2000 14:55:45 -0700
Message-Id: <4.1.20000412142154.00c29220@mail.etranslate.com>
To: w3c-i18n-ig@w3.org
Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org
>For the record, and this will come as no surprise, I totally oppose this 
>change, because I do *not* think 16LE and 16BE are appropriate for use with
>XML, as they fly in the face of XML's orientation towards interoperability
>across heterogeneous systems.  I think XML entities encoded in any flavor
>of UTF-16 should always have a BOM; exactly what the current spec [correctly
>IMHO] says.

Should it be considered an error if it doesn't contain a BOM?  IMHO, in the
absence of a BOM, UTF-16BE should be assumed.  If the charset declaration
and BOM disagree, it is a fatal error.  In case of UTF-16 declaration, the
BOM determines which one of UTF-16LE or UTF-16BE it is.  Including these
names (UTF-16LE & UTF-16BE) in the charset name possibility just adds more
wrinkles and probably more confusions and definitely more errors.  I
thoroughly disapprove of the LE and BE suffixes.

RFC2781 makes it an error to have a BOM in case of UTF-16LE or UTF-16BE
charset declaration.  Why should it be such especially if there is no
contradiction?  RFC2781 also says:

"Text labelled "UTF-16LE" can always be interpreted as being little-
   endian. The detection of an initial BOM does not affect de-
   serialization of text labelled as UTF-16LE. Finding 0xFE followed by
   0xFF is an error since there is no Unicode character 0xFFFE, which
   would be the interpretation of those octets under little-endian
   order."

Well, FEFF is not being interpreted as a character but as a mark which is
very different.  But interestingly enough, FEFF is allowed in case UTF-16
is the charset declaration.  


Sujatha.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sujatha N. Marsden
Chief Scientist
eTranslate, Inc.
Received on Wednesday, 12 April 2000 17:52:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:59:30 GMT