[whatwg] UTF-16 encoding default from Ian Hickson on 2009-07-15 (public-whatwg-archive@w3.org from July 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 15 Jul 2009 04:24:25 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0907150401570.23663@hixie.dreamhostps.com>

On Wed, 24 Jun 2009, Kartikaya Gupta wrote:
>
> There's a page 
> (http://www.microsoft.com/windowsmobile/mobile/en-us/totalaccess/software/software/eula-sw-netflix.mspx 
> specifically) that has a Content-Type header of "text/html; 
> charset=utf-16" and has no BOM. The references I've seen (RFC2781, as 
> well as http://unicode.org/faq/utf_bom.html#gen7) say that this means 
> the content should be assumed to be UTF-16BE. The page, however, is 
> actually in UTF-16LE.
> 
> All browsers seem to do some sort of unspecified magic and figure out 
> that the page is in LE. I was wondering if that magic could be described 
> and added to the HTML5 spec so that it covers rendering the above page 
> as expected. According to the draft spec as it stands, I believe that 
> page should be rendered as garbage.

IE and Safari assume UTF-16LE if the content is labeled as UTF-16.

Firefox checks to see if the first two bytes are null/not-null or 
not-null/null and acts accordingly; if they're both not null it uses BE 
and if they're both null it does something I don't recognise (and checks 
both the UTF-8 and UTF-16 character encodings in the menu...).

Opera appears to assume UTF-16BE unless the second, fourth, sixth, eighth, 
and tenth bytes are null and the first, third, fifth, seventh, and ninth 
bytes are not, in which case it assumes BE.

I've added a requirement in the spec that UTF-16 with no BOM be treated as 
LE rather than BE.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 14 July 2009 21:24:25 UTC