[whatwg] [encoding] utf-16

On Thu, 29 Dec 2011 11:37:25 +0100, Leif Halvard Silli  
<xn--mlform-iua at m?lform.no> wrote:
> Anne van Kesteren Wed Dec 28 08:11:01 PST 2011:
>> On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli wrote:
>>> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
>>> to handle BOM-less little-endian as well as bom-less big-endian.
>>> Whereas if you send 'utf-16le' via HTTP, then it only accepts
>>> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.
>>
>> Right. I think we should do it like Trident.
>
> To behave like Trident is quite difficult unless one applies the logic
> that Trident does. First and foremost, the BOM must be treated the same
> way that Trident and Webkit treat them. Secondly: It might not be be
> desirable to behave exactly like Trident because Trident doesn't really
> handle UTF-16 *at all* unless the file starts wtih the BOM - [...]

Yeah I noticed the weird thing with caching too. Anyway, I meant  
WebKit/Trident.


>> I personally think everything but UTF-8 should be non-conforming,  
>> because of the large number of gotchas embedded in the platform if you  
>> don't use
>> UTF-8. Anyway, it's not logical because I suggested to follow Trident
>> which has different behavior for utf-16 and utf-16be.
>
> We simplify - remove a gotcha - if we say that BOM-less UTF-16 should
> be non-conforming. From every angle, BOM-less UTF-16 as well as
> "BOM-full" UTF-16LE and UTF-16BE, makes no sense.

That's only one. Form submission will use UTF-8 if you use UTF-16,  
XMLHttpRequest is heavily tied to UTF-8, URLs are tied to UTF-8. Various  
new formats such as Workers, cache manifests, WebVTT, are tied to UTF-8.  
Using anything but UTF-8 is going to hurt and will end up confusing you  
unless you know a shitload about encodings and the overall platform, which  
most people don't.


> You perhaps would like to see this bug, which focuses on how many
> implementations, including XML-implementations, give precedence to the
> BOM over other encoding declarations:
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
>
> *Before* paying attention to the actual encoding, you say. More
> correct: Before deciding whether to pay attention to the 'actual'
> encoding, they look for a BOM.

Yeah, I'm going to file a new bug so we can reconsider although the octet  
sequence the various BOMs represent can have legitimate meanings in  
certain encodings, it seems in practice people use them for Unicode.  
(Helped by the fact that Trident/WebKit behave this way of course.)


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Thursday, 29 December 2011 04:07:14 UTC