[whatwg] [encoding] utf-16 from Leif Halvard Silli on 2011-12-28 (public-whatwg-archive@w3.org from December 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 28 Dec 2011 03:20:26 +0100
Message-ID: <20111228032026825420.5a921fff@xn--mlform-iua.no>
Hi Anne. Over all, your findings corresponds with mine, which are based 
on <http://malform.no/testing/utf/>. I also agree with the direction of 
the conclusions, but I would like the encodings document to make some 
distinctions that it currently doesn't - and which you have not 
proposed either. See below.

Anne van Kesteren wrote:
 [ snip ]
> I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should 
follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in
absence of a BOM.

By "default" you supposedly mean "default, before error 
handling/heuristic detection". Relevance: On the "real" Web, no browser 
fails to display utf-16 as often as Webkit - its defaulting behavior 
not withstanding - it can't be a goal to replicate that, for instance.

> utf-16le becomes a label for utf-16.

* Logically, utf-16be should become a label for utf-16 then, as well. 
Is that what you suggest? Because, if the BOM can change the meaning of 
utf-16be, then it makes sense to treat the utf-16be label as well as 
the utf-16le label as synonymous with utf-16. (Thus, effectively 
utf-16le and utf-16be becomes defunct/unreliable on the Web.)

* W.r.t. making utf-16le and utf-16be into 'label[s] for utf-16', then 
OK, when it comes to how UAs should *treat* them. But I suppose it 
should not be considered conforming to *send* the UTF-16LE/UTF-16BE 
labels with text/html, due to their ambiguous status on the Web. Rather 
it should only be conforming to send 'utf-16'.

> A BOM overrides the direction (of utf-16 / utf-16be) and is removed from 
the output.

FIRSTLY: Another way to see this is to say: IE and Webkit do not 
support 'utf-16le' or 'utf-16be' - they only support 'utf-16' but 
defaults to little endian rather than big endian when BOM is omitted. 
When bom is "omitted" for utf-16be, then they default to big endian, 
making 'utf-16le' an alias of Microsoft's private 'unicode' label and 
'utf-16be' an alias of Microsoft's private 'unicodeFFFE' label (which 
each of them uses the BOM). In other words: On the Web, then 'utf-16le' 
and 'utf-16be' becomes de-facto synonyms for MS 'unicode' and MS 
'unicodefffe'. 

SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM 
should override the HTTP level charset info. OK. But then you should go 
the full way, and give the BOM the same, overriding authority when it 
comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type 
header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file 
itself contains a BOM (that contradicts the HTTP info), then the BOM 
"wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the 
X-Content-Type: header has no effect w.r.t. treating the HTTP's charset 
info as authoritative - the BOM wins even then.)

Summary: It makes no sense to treat the BOM as winning over the HTTP 
level charset parameter *only* for UTF-16 - the UTF-8 BOM must have 
same overriding effect as well. Thus: Any encoding info from the header 
would be overridden by the BOM. Of course, a documents where the BOM 
contradicts the HTTP charset, should not be considered conforming. But 
the UA treatment of them should still be uniform.

(PS: If you insert the BOM as &#xfeff; before <!DOCTYPE html>, then IE 
will use UTF-8, when it loads the page from cache. Just say'in.)
--
Leif H Silli
Received on Tuesday, 27 December 2011 18:20:26 UTC