[whatwg] [encoding] utf-16 from Anne van Kesteren on 2011-12-28 (public-whatwg-archive@w3.org from December 2011)

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 28 Dec 2011 16:13:56 +0100
Message-ID: <op.v67glidc64w2qv@annevk-macbookpro.local>

On Wed, 28 Dec 2011 12:30:49 +0100, Leif Halvard Silli  
<xn--mlform-iua at m?lform.no> wrote:
> I spotted a shortcoming in your testing:
>
>> I ran some utf-16 tests using 007A as input data, optionally preceded by
>> FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the
>> Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome
>> 17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is  
>> Opera
>> 11.60. Gecko is Nightly 12.0a1 (2011-12-26).
>>
>> HTTP      BOM   Trident  WebKit  Gecko  Presto
>> utf-16    -     7A00     7A00    007A   007A
>> utf-16le  -     7A00     7A00    7A00   7A00
>> utf-16be  -     007A     007A    007A   007A
>
> The above test row is not complete. You should also run a BOM-less test
> using the UTF-16 label but where the 007A is represented in the
> big-endian way - a bit like I did here:
> <http://malform.no/testing/utf/#html-table-7>. The you get as result
> that Opera and Firefox do not take it for a given that files sent as
> 'utf-16' are big-endian:
>
>   utf-16    -     gibb*    gibb*   007A   007A
>
> *gibb = gibberish/mojibake.

I get U+7A00 as I indicated above. I would not qualify that as gibberish  
personally. (My table is somewhat confusing as input 007A was meant to  
describe octets, but the table describes code points.)

Anyway, per  
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-July/021102.html  
Presto and Gecko do have some magic, but it seems better if they were the  
same as Trident (and WebKit).


> That the BOM is removed from the output for utf-16be labelled files,
> means that the 'utf-16be' labelled file nevertheless is treated as
> UTF-16 (per UTF-16's specification). (Otherwise, if it had not been
> removed, the BOM character should have caused quirks mode.)
>
> Taking what you did not test for into account, it would make sense if
> 'utf-16' continues to be treated as a label under which both big-endian
> and litt-endian can be expected. And thus, that Webkit and IE starts to
> detect when UTF-16 is big-endian, but without a BOM.

I am not sure what you are trying to say here.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 28 December 2011 07:13:56 UTC