- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Wed, 28 Dec 2011 03:20:26 +0100
Hi Anne. Over all, your findings corresponds with mine, which are based on <http://malform.no/testing/utf/>. I also agree with the direction of the conclusions, but I would like the encodings document to make some distinctions that it currently doesn't - and which you have not proposed either. See below. Anne van Kesteren wrote: [ snip ] > I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in absence of a BOM. By "default" you supposedly mean "default, before error handling/heuristic detection". Relevance: On the "real" Web, no browser fails to display utf-16 as often as Webkit - its defaulting behavior not withstanding - it can't be a goal to replicate that, for instance. > utf-16le becomes a label for utf-16. * Logically, utf-16be should become a label for utf-16 then, as well. Is that what you suggest? Because, if the BOM can change the meaning of utf-16be, then it makes sense to treat the utf-16be label as well as the utf-16le label as synonymous with utf-16. (Thus, effectively utf-16le and utf-16be becomes defunct/unreliable on the Web.) * W.r.t. making utf-16le and utf-16be into 'label[s] for utf-16', then OK, when it comes to how UAs should *treat* them. But I suppose it should not be considered conforming to *send* the UTF-16LE/UTF-16BE labels with text/html, due to their ambiguous status on the Web. Rather it should only be conforming to send 'utf-16'. > A BOM overrides the direction (of utf-16 / utf-16be) and is removed from the output. FIRSTLY: Another way to see this is to say: IE and Webkit do not support 'utf-16le' or 'utf-16be' - they only support 'utf-16' but defaults to little endian rather than big endian when BOM is omitted. When bom is "omitted" for utf-16be, then they default to big endian, making 'utf-16le' an alias of Microsoft's private 'unicode' label and 'utf-16be' an alias of Microsoft's private 'unicodeFFFE' label (which each of them uses the BOM). In other words: On the Web, then 'utf-16le' and 'utf-16be' becomes de-facto synonyms for MS 'unicode' and MS 'unicodefffe'. SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM should override the HTTP level charset info. OK. But then you should go the full way, and give the BOM the same, overriding authority when it comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file itself contains a BOM (that contradicts the HTTP info), then the BOM "wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the X-Content-Type: header has no effect w.r.t. treating the HTTP's charset info as authoritative - the BOM wins even then.) Summary: It makes no sense to treat the BOM as winning over the HTTP level charset parameter *only* for UTF-16 - the UTF-8 BOM must have same overriding effect as well. Thus: Any encoding info from the header would be overridden by the BOM. Of course, a documents where the BOM contradicts the HTTP charset, should not be considered conforming. But the UA treatment of them should still be uniform. (PS: If you insert the BOM as  before <!DOCTYPE html>, then IE will use UTF-8, when it loads the page from cache. Just say'in.) -- Leif H Silli
Received on Tuesday, 27 December 2011 18:20:26 UTC