Re: UTF-16, UTF-16BE and UTF-16LE in HTML5 from Martin J. Dürst on 2010-07-30 (www-international@w3.org from July to September 2010)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 30 Jul 2010 13:14:47 +0900
To: John Cowan <cowan@ccil.org>
CC: Richard Ishida <ishida@w3.org>, francois@yergeau.com, public-html@w3.org, www-international@w3.org
Message-ID: <4C5251B7.5020101@it.aoyama.ac.jp>
Hello John, Francois (long time no see!) an others,

On 2010/07/27 6:14, John Cowan wrote:

> On 15 July 2010 22:43, François Yergeau wrote:
>
>> It depends on what you mean by "UTF-16 encoded documents".  In the XML
>> spec, a "document in the UTF-16 encoding" means (somewhat strangely, I
>> would agree) that the document is actually in UTF-16 (OK so far) and
>> that the encoding has been identified as "UTF-16".  Not "UTF-16BE" or
>> "UTF-16LE", these are different beasts, even though the actual encoding
>> is of course the same.

Yes. For us humans, UTF-16, UTF-16BE, and UTF-16LE look extremely close 
(because they are). However, on a spec level, and on an implementation 
level, these are just different encodings with different 'charset' 
labels. Different encodings and different labels means that UTF-16BE and 
UTF-16LE are treated just the same way as 'foo-unknown-bar', a label 
that I just made up now (and won't bother to define the actual encoding).

If some XML parser sees
<?xml version='1.0' encoding='foo-unknown-bar' ?>
it will just say "unknown encoding" or some such, or it might (for 
values other than 'foo-unknown-bar', actually defined and registered) 
use that character encoding for decoding and parsing.

The same is supposed to happen if you have something like
<?xml version='1.0'
encoding='Extended_UNIX_Code_Fixed_Width_for_Japanese' ?>
(with null bytes before each of the bytes/characters that you can see 
above; check out Extended_UNIX_Code_Fixed_Width_for_Japanese at 
http://www.iana.org/assignments/character-sets): Either the parser knows 
this encoding, or it doesn't.

The same also has to apply if you happen to have:
<?xml version='1.0' encoding='UTF-16BE' ?>
or
<?xml version='1.0' encoding='UTF-16LE' ?>
(again with null bytes sprinkled in, which are omitted here for your 
mail software's convenience). Why? Because everybody who implemented an 
XML parser before UTF-16BE or UTF-16LE were defined was just checking 
for "UTF-16", and there is no way for a parser to suddenly say "oh, 
'UTF-16BE' seems to start with 'UTF-16', so maybe these are related, so...".

The rule "'UTF-16' in XML MUST have a BOM" (which is not true of 
'UTF-16' in general) makes sense and can be justified because 'UTF-16' 
has the privilege of being groked by all parsers, even without explicit 
label. However, creating special rules for UTF-16BE or UTF-16LE would 
not have made sense, because we don't really know for sure what other 
encodings similar to UTF-16 might turn up (I very much hope none!).

> The reason for that is XML-specific.  An XML document entity cannot
> begin with a ZWNBSP, so if it begins with the bytes 0xFF 0xFE, it must
> be a UTF-16 entity body with a BOM.  But if the entity is an external
> parsed entity (document fragment) or external parameter entity (DTD
> fragment), then it may begin with any XML-legal Unicode character,
> including ZWNBSP, which would also be 0xFF 0xFE in UTF-16LE encoding.
> The result is a nasty ambiguity: does the document's character content
> start with ZWNBSP or not?  (And analogously for 0xFE 0xFF and
> UTF-16BE.)
>
> The Core WG decided to resolve the ambiguity in favor of the UTF-16
> encoding.  An external entity that appears to begin with a BOM does
> begin with a BOM.  If you need to create an external entity beginning
> with a ZWNBSP, you use UTF-8 or UTF-16, or else you use an explicit
> encoding declaration.

Or use a numeric character entity &-).

>> So XML parsers are not strictly required to grok UTF-16 documents
>> labelled as UTF-16BE/LE.
>
> Correct.  The only encodings a parser is absolutely required to
> support are UTF-8 (with or without BOM) and UTF-16 (with BOM).

Yes, correct. In a strict view, there are no "UTF-16 documents labeled 
as UTF-16BE (or ..LE)". A document labeled as UTF-16BE is a document 
encoded in UTF-16BE. A document labeled as UTF-16LE is a document 
encoded as UTF-16LE. That these (with the exception that those labeled 
'UTF-16' have a leading BOM) are essentially byte-for-byte identical is, 
on a spec and implementation level, just an accident. Compare that to 
documents labeled as UTF-8 (or unlabeled), labeled as US-ASCII, as 
ISO-8859-1/2/.., and which only contains characters in the ASCII range. 
These are all documents labeled differently and therefore in different 
encodings, although they are byte-for-byte identical.

I remember that I once spent an hour or so on the phone working on 
convincing Tim Bray that this was the only sound (from a 
spec/implementation point, not necessarily from a human higher level 
understanding point) way to look at and deal with these things. That 
must have been something like 10 years ago.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Friday, 30 July 2010 04:15:29 UTC