Re: UTF-16, UTF-16BE and UTF-16LE in HTML5 from John Cowan on 2010-07-26 (www-international@w3.org from July to September 2010)

From: John Cowan <cowan@ccil.org>
Date: Mon, 26 Jul 2010 17:14:58 -0400
To: Richard Ishida <ishida@w3.org>, francois@yergeau.com
Cc: public-html@w3.org, www-international@w3.org
Message-ID: <AANLkTim1T4_5081vsALXAk7iH724zyKC1CWK6BGwDkAd@mail.gmail.com>

On Mon, Jul 26, 2010 at 2:52 PM, Richard Ishida <ishida@w3.org> wrote:

> I have summarised in simplified and graphic form my understanding of the
> algorithm in html5 for detecting character encodings. See
> http://www.w3.org/International/2010/07/html5-encoding-detection.png

Part of the problem, I think, is the confusion between "UTF-16LE" as
the formal name of an encoding, which does not permit a BOM and is
always little-endian, and the informal use of "utf16le" in your
diagram to mean a document which in fact uses little-endian encoding.
(And analogously for "UTF16-BE".)

> Please see the explanation from François Yergeau below about use of BOM and
> UTF-16, UTF-16BE and UTF-16LE (forwarded with permission). As I understand
> it, you should use a BOM if you have identified or labelled the content as
> 'UTF-16', ie. with no indication of the endianness. The Unicode Standard
> also says that if you have labelled or identified your text as 'UTF-16BE' or
> 'UTF-16LE', you should not use a BOM (since it should be interpreted as a
> word joiner at the start of the text).

Correct.

> This rules out the use of UTF-16BE and UTF16-LE character encodings, since
> they should not start with a BOM.

That's true, but those encodings aren't really very useful: you only
save two bytes.  The main use for UTF-16BE and UTF-16LE encodings is
when you can't afford a BOM, because you have zillions of short
strings to deal with, as in a database.

> A little later, the spec says
> "If an HTML document contains a meta element with a charset attribute or a
> meta  element with an http-equiv  attribute in the Encoding declaration
> state, then the character encoding used must be an ASCII-compatible
> character encoding."
>
> This rules out the use of a character encoding declaration with the value
> UTF-16, even in content that is encoded in that encoding.

That *is* a Bad Thing, and should be fixed.  The simplest operational
approach is just to make sure that 0x00 bytes are always ignored when
parsing encoding declarations.  See my posting "Hello! I am an XML
encoding sniffer!" at
http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html
.

On 15 July 2010 22:43, François Yergeau wrote:

> It depends on what you mean by "UTF-16 encoded documents".  In the XML
> spec, a "document in the UTF-16 encoding" means (somewhat strangely, I
> would agree) that the document is actually in UTF-16 (OK so far) and
> that the encoding has been identified as "UTF-16".  Not "UTF-16BE" or
> "UTF-16LE", these are different beasts, even though the actual encoding
> is of course the same.

The reason for that is XML-specific.  An XML document entity cannot
begin with a ZWNBSP, so if it begins with the bytes 0xFF 0xFE, it must
be a UTF-16 entity body with a BOM.  But if the entity is an external
parsed entity (document fragment) or external parameter entity (DTD
fragment), then it may begin with any XML-legal Unicode character,
including ZWNBSP, which would also be 0xFF 0xFE in UTF-16LE encoding.
The result is a nasty ambiguity: does the document's character content
start with ZWNBSP or not?  (And analogously for 0xFE 0xFF and
UTF-16BE.)

The Core WG decided to resolve the ambiguity in favor of the UTF-16
encoding.  An external entity that appears to begin with a BOM does
begin with a BOM.  If you need to create an external entity beginning
with a ZWNBSP, you use UTF-8 or UTF-16, or else you use an explicit
encoding declaration.

> So XML parsers are not strictly required to grok UTF-16 documents
> labelled as UTF-16BE/LE.

Correct.  The only encodings a parser is absolutely required to
support are UTF-8 (with or without BOM) and UTF-16 (with BOM).

Received on Monday, 26 July 2010 21:15:53 UTC