Re: UTF-16BL/LE,... (was: Re: I18N issues with the XML Specification from Martin J. Duerst on 2000-04-13 (xml-editor@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Thu, 13 Apr 2000 11:47:05 +0900
To: "Sujatha N. Marsden" <smarsden@etranslate.com>, w3c-i18n-ig@w3.org
Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org
Message-Id: <4.2.0.58.J.20000413113316.03429840@sh.w3.mag.keio.ac.jp>

At 00/04/12 14:55 -0700, Sujatha N. Marsden wrote:
> >For the record, and this will come as no surprise, I totally oppose this
> >change, because I do *not* think 16LE and 16BE are appropriate for use with
> >XML, as they fly in the face of XML's orientation towards interoperability
> >across heterogeneous systems.  I think XML entities encoded in any flavor
> >of UTF-16 should always have a BOM; exactly what the current spec [correctly
> >IMHO] says.
>
>Should it be considered an error if it doesn't contain a BOM? IMHO, in the
>absence of a BOM, UTF-16BE should be assumed.

XML makes special promises in the case of UTF-16:
1) All XML processors have to be able to deal with it
2) If there is no charset information, UTF-8 or UTF-16
    have to be assumed. To distinguish them, the BOM is used.

It would be possible to say that XML tagged as UTF-16 but
not starting with a BOM is legal. This might be desirable
in the sense that it would reduce the interdependency
between 'charset' definitions and the XML Rec. However,
the XML spec clearly says that if it's UTF-16, it has to
have a BOM, and whichever way you interpret 'UTF-16',
changing it to not require a BOM when it is tagged as
UTF-16 would be a clear change to the spec that I think
is undesirable, as opposed to the clarifications that
we are working on now.

>"Text labelled "UTF-16LE" can always be interpreted as being little-
>    endian. The detection of an initial BOM does not affect de-
>    serialization of text labelled as UTF-16LE. Finding 0xFE followed by
>    0xFF is an error since there is no Unicode character 0xFFFE, which
>    would be the interpretation of those octets under little-endian
>    order."
>
>Well, FEFF is not being interpreted as a character but as a mark which is
>very different.  But interestingly enough, FEFF is allowed in case UTF-16
>is the charset declaration.

Finding U+FFFE in any kind of UTF-16 flavor is an error, because
it's not a character. In LE, this codepoint would be encoded 0xFE
0xFF, so the above text is absolutely correct.

You may find 0xFF 0xFE in LE text, even at the start, but in this
case, it's not a BOM, it's just a ZWNJ. The BOM was forbidden at
the start of UTF-16BE/LE among else to make this case unambiguous.

Regards,   Martin.

Received on Wednesday, 12 April 2000 22:51:34 UTC