W3C home > Mailing lists > Public > xml-editor@w3.org > October to December 2006

FW: UTF-16 and Byte Order Mark

From: Grosso, Paul <pgrosso@ptc.com>
Date: Wed, 20 Dec 2006 18:14:13 -0500
Message-ID: <CF83BAA719FD2C439D25CBB1C9D1D30205BDE78E@HQ-MAIL4.ptcnet.ptc.com>
To: <xml-editor@w3.org>
Cc: <d.k@philo.de>

Forwarding to the public comment's list.


-----Original Message-----
From: public-xml-core-wg-request@w3.org
[mailto:public-xml-core-wg-request@w3.org] On Behalf Of John Cowan
Sent: Wednesday, 2006 December 20 14:52
To: d.k@philo.de
Cc: public-xml-core-wg@w3.org
Subject: Re: UTF-16 and Byte Order Mark

Our apologies for the long delay in responding to your message.
The content of this message has been approved by the XML Core WG.

You wrote at

> Appendix F.1 of the XML specs presents examples about how to
> automatically detect the encoding of an entity from the first
> characters of an XML encoding declaration without a byte order mark.
> These examples include UTF-16BE and UTF-16LE. However, section 4.3.3
> says that entities encoded in UTF-16 MUST begin with a byte order

That is strictly limited to the UTF-16 encoding, and excludes the
related UTF-16LE and UTF-16BE encodings, in which BOMs are not present.
Note that "UTF16-LE" does not mean "UTF-16 encoding whose BOM shows it
to be little-endian" but rather "UTF-16-like encoding in little-endian
order without a BOM."  If U+FEFF appears at the beginning of a UTF-16LE
UTF16-BE document, it is not a BOM but a ZWNBSP character (and therefore
the document cannot be well-formed XML.  cannot be well-formed XML),
not a BOM.

> In the light of the examples it seems that the intention of the specs
> to demand a UTF-16 byte order mark only when no XML declaration is
> Is this interpretation of the specs correct?

No.  If the encoding is UTF-16, a BOM is mandatory, whether or not an
XML declaration is present.

> If the answer is "no", I would suggest to remove the two incriminated
> examples from Appendix F.1 and to add an appropriate warning.

The examples are not in error, because they refer to the UTF-16LE and
UTF-16BE encodings rather than the UTF-16 encoding.

The Core WG will be adding language to 4.3.3 stating that UTF-16BE and
UTF-16LE are specifically not UTF-16.

I marvel at the creature: so secret and         John Cowan
so sly as he is, to come sporting in the pool   cowan@ccil.org
before our very window.  Does he think that
Men sleep without watch all night?
Received on Wednesday, 20 December 2006 23:14:24 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:12:50 UTC