[whatwg] [encoding] utf-16 from Leif Halvard Silli on 2011-12-28 (public-whatwg-archive@w3.org from December 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 28 Dec 2011 12:31:12 +0100
Message-ID: <20111228123112854288.766ae647@xn--mlform-iua.no>
Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
> On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:

>> By "default" you supposedly mean "default, before error
>> handling/heuristic detection". Relevance: On the "real" Web, no browser
>> fails to display utf-16 as often as Webkit - its defaulting behavior
>> not withstanding - it can't be a goal to replicate that, for instance.
> 
> Do you mean heuristics when it comes to the decoding layer? Or before  
> that? I do think any heuristics ought to be defined.

Meant: While UAs may prepare for little-endian when seeing the 'utf-16' 
label, they should also be prepared for detecting it as big-endian.

As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared 
to handle BOM-less little-endian as well as bom-less big-endian. 
Whereas if you send 'utf-16le' via HTTP, then it only accepts 
'utf-16le'. The same also goes for Opera. But not for Webkit and IE.

>>> utf-16le becomes a label for utf-16.
>>
>> * Logically, utf-16be should become a label for utf-16 then, as well.
> 
> That's not logical.

Care to elaborate?

To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense 
if you plan to make it non-conforming to send files with the 'utf-16' 
label unless they are little-endian encoded.

Note that in 'utf-16be' and 'utf-16le', then - per the UTF-16 
specification - the BOM is not a BOM. Citing Wikipedia: ?UTF-16BE or 
UTF-16LE as the encoding type. When the byte order is specified 
explicitly this way, a BOM is specifically not supposed to be prepended 
to the text, and a U+FEFF at the beginning should be handled as a 
ZWNBSP character.? (Which, in turn, should trigger quirks mode.)

Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if 
the ZWNBSP character at the beginning of a 'utf-16be' labelled file is 
treated as the BOM, then we do not speak about the 'utf-16be' encoding, 
but about a mislabelled 'utf-16' file.

>> Is that what you suggest? Because, if the BOM can change the meaning of
>> utf-16be, then it makes sense to treat the utf-16be label as well as
>> the utf-16le label as synonymous with utf-16. (Thus, effectively
>> utf-16le and utf-16be becomes defunct/unreliable on the Web.)
> 
> No, because utf-16be actually has different behavior in absence of a BOM.  
> It does mean they can share some common algorithm(s), but they have to  
> stay different encodings.

Per the UTF-16 specification, the 'utf-16' label covers both big-endian 
and little-endian. Thus it covers - in a way - two encodings. Hence, 
that we have to treat little endian BOMless UTF-16 different from big 
endian BOMless UTF-16 thus does should not need to mean that they are 
different encodings.

>> SECONDLY: You effectively say that, for the UTF-16 BOM, then the BOM
>> should override the HTTP level charset info. OK. But then you should go
>> the full way, and give the BOM the same, overriding authority when it
>> comes to the UTF-8 BOM. For instance, if the HTTP server's Content-Type
>> header specifies ISO-8859-1 (or 'utf-8' or 'utf-16'), but the file
>> itself contains a BOM (that contradicts the HTTP info), then the BOM
>> "wins" - in IE and WEbkit. (And, btw, w.r.t. IE, then the
>> X-Content-Type: header has no effect w.r.t. treating the HTTP's charset
>> info as authoritative - the BOM wins even then.)
> 
> No, I don't see why we have to go there at all. All this suggests is that  
> within the two utf-16 encodings

What are 'the two utf-16 encodings'? There are 3 UTF-16 encodings per 
the UTF-16 spec. There are 2 endian options but 3 encodings.

> the first four bytes have special meaning.  
> That does not all suggest we should do the same for numerous other  
> encodings unrelated to utf-16.

Why not? I see absolutely no difference here. When would you like to 
render a page with a BOM as anything other than what the BOM specifies? 
Use cases? To not treat it like the BOM would render the page in 
quirks-mode - when does one want that?

The only way where it can make some sense to not treat the UTF-8 BOM 
that way, would be if we see both 'utf-16le' and 'utf-16be' as - on the 
Web - de-facto synonyms for 'ut-16'. (Because then UAs would have 
indirect permission from the UTF-16 spec, to 'sniff' the UTF-16 flavour 
of the BOM even if HTTP says 'utf-16le' or 'utf-16be'.)

Note as well that this is not only related to 'numerous other 
encodings' but directly related to UTF-16 itself: If HTTP says 'utf-16' 
but the BOM is a UTF-8 BOM (or opposite, if HTTP says 'utf-8' but the 
BOM is a utf-16 BOM), then Webkit and IE both use the encoding that the 
BOM specifies.

If it is Trident/Webkit which is supposed to send the standard here, 
then please do. You are glossing over how Trident/Webkit behave if you 
fail to recognize that the issue here is them giving preference to the 
BOM over HTTP. (There is even precedent long into the XML world for 
giving preference to the BOM.)
-- 
Leif Halvard Silli
Received on Wednesday, 28 December 2011 03:31:12 UTC