- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Wed, 28 Dec 2011 12:30:49 +0100
Anne van Kesteren Tue Dec 27 06:52:01 PST 2011: I spotted a shortcoming in your testing: > I ran some utf-16 tests using 007A as input data, optionally preceded by > FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the > Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome > 17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is Opera > 11.60. Gecko is Nightly 12.0a1 (2011-12-26). > > HTTP BOM Trident WebKit Gecko Presto > utf-16 - 7A00 7A00 007A 007A > utf-16le - 7A00 7A00 7A00 7A00 > utf-16be - 007A 007A 007A 007A The above test row is not complete. You should also run a BOM-less test using the UTF-16 label but where the 007A is represented in the big-endian way - a bit like I did here: <http://malform.no/testing/utf/#html-table-7>. The you get as result that Opera and Firefox do not take it for a given that files sent as 'utf-16' are big-endian: utf-16 - gibb* gibb* 007A 007A *gibb = gibberish/mojibake. > utf-16 FFFE 7A00 7A00 7A00 7A00 > utf-16le FFFE 7A00 7A00 7A00 7A00 > utf-16be FFFE 7A00 7A00 FFFD* FFFD* > > utf-16 FEFF 007A 007A 007A 007A > utf-16le FEFF 007A 007A FFFD** FFFD** > utf-16be FEFF 007A 007A 007A 007A > > * Gecko decodes FFFE 007A as FFFD followed by FE00 presumably dropping the > 7A. Opera decodes it as FFFD 007A. > ** Gecko decoes FEFF 007A as FFFD followed by 00FF presumably dropping the > 7A. Opera decodes it as FFFD 7A00. > > It seems in Trident/WebKit utf-16 and utf-16le are labels for the same > encoding and the BOM is more important than the encoding. Gecko and Presto > match existing specifications around utf-16 with different error handling > (afaict). > > I think http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html should > follow Trident/WebKit. Specifically: utf-16 defaults to utf-16le in > absence of a BOM. utf-16le becomes a label for utf-16. A BOM overrides the > direction (of utf-16 / utf-16be) and is removed from the output. That the BOM is removed from the output for utf-16be labelled files, means that the 'utf-16be' labelled file nevertheless is treated as UTF-16 (per UTF-16's specification). (Otherwise, if it had not been removed, the BOM character should have caused quirks mode.) Taking what you did not test for into account, it would make sense if 'utf-16' continues to be treated as a label under which both big-endian and litt-endian can be expected. And thus, that Webkit and IE starts to detect when UTF-16 is big-endian, but without a BOM. -- Leif H Silli
Received on Wednesday, 28 December 2011 03:30:49 UTC