Re: write and UTF-16[BE|LE] from John Cowan on 2004-02-05 (www-dom@w3.org from January to March 2004)

From: John Cowan <cowan@ccil.org>
Date: Wed, 4 Feb 2004 22:46:53 -0500
To: Philippe Le Hegaret <plh@w3.org>
Cc: www-dom@w3.org
Message-ID: <20040205034651.GC25073@ccil.org>

Philippe Le Hegaret scripsit:

> Looking again at XML 1.0 3rd, it says that UTF-16 encoded entities MUST
> being with a BOM. Unless I'm misinterpreting the meaning of "UTF-16
> encoded entities", I would say that it does include UTF16-BE and
> UTF16-LE as well.

The sentence in the previous paragraph:

# The terms "UTF-8" and "UTF-16" in this specification do not apply to
# character encodings with any other labels, even if the encodings or
# labels are very similar to UTF-8 or UTF-16.

is intended to cover UTF-16BE and UTF-16LE, as well as CESU-8.  Only the
encodings actually named "UTF-16" and "UTF-8" are allowed without an
encoding declaration, and the "UTF-16" encoding actually requires a BOM
in XML, though not in plain text generally.

After all, the whole point of the encodings "UTF-16BE" and "UTF-16LE"
is that they MUST NOT begin with a BOM: if the first two bytes are 0xFE
0xFF or 0xFF 0xFE, that represents a U+FEFF character, and if document
is not well formed.

-- 

Trust me on this one.  I was there.  :-)

-- 
One Word to write them all,             John Cowan <jcowan@reutershealth.com>
  One Access to find them,              http://www.reutershealth.com
One Excel to count them all,            http://www.ccil.org/~cowan
  And thus to Windows bind them.                --Mike Champion

Received on Wednesday, 4 February 2004 22:52:24 UTC