[whatwg] [encoding] utf-16 from Leif Halvard Silli on 2011-12-29 (public-whatwg-archive@w3.org from December 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 29 Dec 2011 11:37:25 +0100
Message-ID: <20111229113725176012.33b3e943@xn--mlform-iua.no>
Anne van Kesteren Wed Dec 28 08:11:01 PST 2011:
> On Wed, 28 Dec 2011 12:31:12 +0100, Leif Halvard Silli wrote:
>> Anne van Kesteren Wed Dec 28 01:05:48 PST 2011:
>>> On Wed, 28 Dec 2011 03:20:26 +0100, Leif Halvard Silli wrote:
>>>> By "default" you supposedly mean "default, before error
>>>> handling/heuristic detection". Relevance: On the "real" Web, no browser
>>>> fails to display utf-16 as often as Webkit - its defaulting behavior
>>>> not withstanding - it can't be a goal to replicate that, for instance.
>>>
>>> Do you mean heuristics when it comes to the decoding layer? Or before
>>> that? I do think any heuristics ought to be defined.
>>
>> Meant: While UAs may prepare for little-endian when seeing the 'utf-16'
>> label, they should also be prepared for detecting it as big-endian.
>>
>> As for Mozilla, if HTTP content-type says 'utf-16', then it is prepared
>> to handle BOM-less little-endian as well as bom-less big-endian.
>> Whereas if you send 'utf-16le' via HTTP, then it only accepts
>> 'utf-16le'. The same also goes for Opera. But not for Webkit and IE.
> 
> Right. I think we should do it like Trident.

To behave like Trident is quite difficult unless one applies the logic 
that Trident does. First and foremost, the BOM must be treated the same 
way that Trident and Webkit treat them. Secondly: It might not be be 
desirable to behave exactly like Trident because Trident doesn't really 
handle UTF-16 *at all* unless the file starts wtih the BOM - just run 
this test to verify:

1)  visit this test suite with IE: 
    <http://malform.no/testing/utf/caching/>
2)  Click yourself through 7 pages in the test, until the 
    last, 'UTF-16' labelled, big-endian, BOM-less page
    (which causes mojibake in IE).
3)  Now, use the Back (or Forward) button to go backward
    (or Forward) page by page. (You will even be able
    see the last,  mojibake-ish page, if you use the 
    Forward button to visit it.)

RESULT: 4 of the 7 files in the test - namely: the UTF-16 files without 
a BOM - fail when IE pulls them from cache. When loaded from cache, the 
non-ASCII letters becomes destructed. Note especially that it doesn't 
matter whether the file is big endian or little endian!

Surely, this is not something that we would like UAs to replicate.

Conclusions: a) BOM-less UTF-16 should simply be considered 
non-conforming on the Web, if Trident is the standard. b) there is no 
need to consider what Trident do with BOM-less files as conforming, 
irrespective of whether the page is big endian or little endian. (That 
it handles little-endian BOM-less files a little better than big-endian 
BOM-less files, is just a marginal advantage.)

>>>>> utf-16le becomes a label for utf-16.
>>>>
>>>> * Logically, utf-16be should become a label for utf-16 then, as well.
>>>
>>> That's not logical.
>>
>> Care to elaborate?
>>
>> To not make 'utf-16be' a de-facto label for 'utf-16', only makes sense
>> if you plan to make it non-conforming to send files with the 'utf-16'
>> label unless they are little-endian encoded.
> 
> I personally think everything but UTF-8 should be non-conforming, because  
> of the large number of gotchas embedded in the platform if you don't use  
> UTF-8. Anyway, it's not logical because I suggested to follow Trident  
> which has different behavior for utf-16 and utf-16be.

We simplify - remove a gotcha - if we say that BOM-less UTF-16 should 
be non-conforming. From every angle, BOM-less UTF-16 as well as 
"BOM-full" UTF-16LE and UTF-16BE, makes no sense.

>> Meaning: The "BOM" should not, for UTF-16be/le, be removed. Thus, if
>> the ZWNBSP character at the beginning of a 'utf-16be' labelled file is
>> treated as the BOM, then we do not speak about the 'utf-16be' encoding,
>> but about a mislabelled 'utf-16' file.
> 
> I never spoke of any existing standard. The Unicode standard is wrong here  
> for all implementations.

Here, at least, you do speak about an existing standard ...  It is 
exactly my point that the browsers don't interpret UTF-16be/le as 
UTF-16be/le but more like UTF-16. But in which why, exactly, is UTF-16 
not specified correctly, you mean?

>>> the first four bytes have special meaning.
>>> That does not all suggest we should do the same for numerous other
>>> encodings unrelated to utf-16.
>>
>> Why not? I see absolutely no difference here. When would you like to
>> render a page with a BOM as anything other than what the BOM specifies?
> 
> Interesting, it does seem like Trident/WebKit look at the specific byte  
> sequences the BOM has in utf-8 and utf-16 before paying attention to the  
> "actual" encoding.

You perhaps would like to see this bug, which focuses on how many 
implementations, including XML-implementations, give precedence to the 
BOM over other encoding declarations: 
https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

*Before* paying attention to the actual encoding, you say. More 
correct: Before deciding whether to pay attention to the 'actual' 
encoding, they look for a BOM.
-- 
Leif Halvard Silli
Received on Thursday, 29 December 2011 02:37:25 UTC