Re: UTF-16, UTF-16BE and UTF-16LE in HTML5 from Philip Taylor on 2010-07-27 (www-international@w3.org from July to September 2010)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Tue, 27 Jul 2010 14:14:11 +0100
To: Richard Ishida <ishida@w3.org>
CC: 'Henri Sivonen' <hsivonen@iki.fi>, public-html@w3.org, www-international@w3.org
Message-ID: <4C4EDBA3.70201@cam.ac.uk>

Richard Ishida wrote:
> Well any encoding declaration may be wrong - participation in the
> encoding detection doesn't mean that the encoding of the document
> will actually be what the declaration says. So I don't think it makes
> much difference.  On the other hand, since actually getting your
> document into a utf-16 encoding is a little more complicated than
> using other encodings, it may be more often right - in which case it
> is extremely useful for people who visually inspect the document,
> given that they can't see the BOM and may otherwise assume that the
> encoding is not utf-16.

http://philip.html5.org/data/charsets-2.html#charset-utf-16 lists the 
pages that declared themselves as UTF-16, out of about 425K pages (from 
about a year ago). None specified UTF-16 in HTTP headers - all were via 
<meta ... content="...">, after parsing with the validator.nu parser's 
default decoding behaviour. As far as I can tell, no pages in the data 
set had a BOM, so none would have been decoded as UTF-16.

(It's possible this data set is biased against certain encodings - I 
don't know the details of how it was collected, and it was provided in a 
format that doesn't allow \0 bytes in pages. But an (older, smaller) 
independent set of pages gives 
http://philip.html5.org/data/charsets.html#charset-utf-16 which seems to 
follow the same pattern. Better data for analysis (and/or better 
analysis) would be welcome.)

Given the number of pages that claim they are UTF-16, and the apparent 
lack of pages that really are UTF-16, it seems untrue that it is "more 
often right" than declarations of other encodings. Someone who visually 
inspects (e.g. with 'view source') a random web page and sees <meta ... 
content="text/html; charset=utf-16"> would most likely be correct to 
assume that the encoding is *not* UTF-16, and would be misled if they 
believed the declaration. So it's still a very bad idea to check the 
encoding by reading this string.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Tuesday, 27 July 2010 13:14:41 UTC