W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2011

[whatwg] [encoding] utf-16

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 28 Dec 2011 10:05:48 +0100
Message-ID: <op.v66zjyyw64w2qv@annevk-macbookpro.local>
On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli  
<xn--mlform-iua at m?lform.no> wrote:
> By "default" you supposedly mean "default, before error
> handling/heuristic detection". Relevance: On the "real" Web, no browser
> fails to display utf-16 as often as Webkit - its defaulting behavior
> not withstanding - it can't be a goal to replicate that, for instance.

Do you mean heuristics when it comes to the decoding layer? Or before  
that? I do think any heuristics ought to be defined.


>> utf-16le becomes a label for utf-16.
>
> * Logically, utf-16be should become a label for utf-16 then, as well.

That's not logical.


> Is that what you suggest? Because, if the BOM can change the meaning of
> utf-16be, then it makes sense to treat the utf-16be label as well as
> the utf-16le label as synonymous with utf-16. (Thus, effectively
> utf-16le and utf-16be becomes defunct/unreliable on the Web.)

No, because utf-16be actually has different behavior in absence of a BOM.  
It does mean they can share some common algorithm(s), but they have to  
stay different encodings.


> SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM
> should override the HTTP level charset info. OK. But then you should go
> the full way, and give the BOM the same, overriding authority when it
> comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type
> header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file
> itself contains a BOM (that contradicts the HTTP info), then the BOM
> "wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the
> X-Content-Type: header has no effect w.r.t. treating the HTTP's charset
> info as authoritative - the BOM wins even then.)

No, I don't see why we have to go there at all. All this suggests is that  
within the two utf-16 encodings the first four bytes have special meaning.  
That does not all suggest we should do the same for numerous other  
encodings unrelated to utf-16.


-- 
Anne van Kesteren
http://annevankesteren.nl/
Received on Wednesday, 28 December 2011 01:05:48 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:38 UTC