Re: UTF-16 and Byte Order Mark

Our apologies for the long delay in responding to your message.
The content of this message has been approved by the XML Core WG.

You wrote at
<http://lists.w3.org/Archives/Public/xml-editor/2006JulSep/0007.html>:

> Appendix F.1 of the XML specs presents examples about how to
> automatically detect the encoding of an entity from the first
> characters of an XML encoding declaration without a byte order mark.
> These examples include UTF-16BE and UTF-16LE. However, section 4.3.3
> says that entities encoded in UTF-16 MUST begin with a byte order mark.

That is strictly limited to the UTF-16 encoding, and excludes the
related UTF-16LE and UTF-16BE encodings, in which BOMs are not present.
Note that "UTF16-LE" does not mean "UTF-16 encoding whose BOM shows it
to be little-endian" but rather "UTF-16-like encoding in little-endian
order without a BOM."  If U+FEFF appears at the beginning of a UTF-16LE or
UTF16-BE document, it is not a BOM but a ZWNBSP character (and therefore
the document cannot be well-formed XML.  cannot be well-formed XML),
not a BOM.

> In the light of the examples it seems that the intention of the specs is
> to demand a UTF-16 byte order mark only when no XML declaration is used.
> Is this interpretation of the specs correct?

No.  If the encoding is UTF-16, a BOM is mandatory, whether or not an
XML declaration is present.

> If the answer is "no", I would suggest to remove the two incriminated
> examples from Appendix F.1 and to add an appropriate warning.

The examples are not in error, because they refer to the UTF-16LE and
UTF-16BE encodings rather than the UTF-16 encoding.

The Core WG will be adding language to 4.3.3 stating that UTF-16BE and
UTF-16LE are specifically not UTF-16.

-- 
I marvel at the creature: so secret and         John Cowan
so sly as he is, to come sporting in the pool   cowan@ccil.org
before our very window.  Does he think that     http://www.ccil.org/~cowan
Men sleep without watch all night?

Received on Wednesday, 20 December 2006 20:52:18 UTC