[whatwg] [encoding] utf-16

I ran some utf-16 tests using 007A as input data, optionally preceded by  
FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the  
Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome  
17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is Opera  
11.60. Gecko is Nightly 12.0a1 (2011-12-26).

HTTP      BOM   Trident  WebKit  Gecko  Presto
utf-16    -     7A00     7A00    007A   007A
utf-16le  -     7A00     7A00    7A00   7A00
utf-16be  -     007A     007A    007A   007A

utf-16    FFFE  7A00     7A00    7A00   7A00
utf-16le  FFFE  7A00     7A00    7A00   7A00
utf-16be  FFFE  7A00     7A00    FFFD*  FFFD*

utf-16    FEFF  007A     007A    007A   007A
utf-16le  FEFF  007A     007A    FFFD** FFFD**
utf-16be  FEFF  007A     007A    007A   007A

* Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the  
7A. Opera decodes it as FFFD 007A.
** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the  
7A. Opera decodes it as FFFD 7A00.


It seems in Trident/WebKit utf-16 and utf-16le are labels for the same  
encoding and the BOM is more important than the encoding. Gecko and Presto  
match existing specifications around utf-16 with different error handling  
(afaict).

I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should  
follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in  
absence of a BOM. utf-16le becomes a label for utf-16. A BOM overrides the  
direction (of utf-16 / utf-16be) and is removed from the output.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Tuesday, 27 December 2011 06:52:01 UTC