- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Tue, 27 Jul 2010 14:14:11 +0100
- To: Richard Ishida <ishida@w3.org>
- CC: 'Henri Sivonen' <hsivonen@iki.fi>, public-html@w3.org, www-international@w3.org
Richard Ishida wrote: > Well any encoding declaration may be wrong - participation in the > encoding detection doesn't mean that the encoding of the document > will actually be what the declaration says. So I don't think it makes > much difference. On the other hand, since actually getting your > document into a utf-16 encoding is a little more complicated than > using other encodings, it may be more often right - in which case it > is extremely useful for people who visually inspect the document, > given that they can't see the BOM and may otherwise assume that the > encoding is not utf-16. http://philip.html5.org/data/charsets-2.html#charset-utf-16 lists the pages that declared themselves as UTF-16, out of about 425K pages (from about a year ago). None specified UTF-16 in HTTP headers - all were via <meta ... content="...">, after parsing with the validator.nu parser's default decoding behaviour. As far as I can tell, no pages in the data set had a BOM, so none would have been decoded as UTF-16. (It's possible this data set is biased against certain encodings - I don't know the details of how it was collected, and it was provided in a format that doesn't allow \0 bytes in pages. But an (older, smaller) independent set of pages gives http://philip.html5.org/data/charsets.html#charset-utf-16 which seems to follow the same pattern. Better data for analysis (and/or better analysis) would be welcome.) Given the number of pages that claim they are UTF-16, and the apparent lack of pages that really are UTF-16, it seems untrue that it is "more often right" than declarations of other encodings. Someone who visually inspects (e.g. with 'view source') a random web page and sees <meta ... content="text/html; charset=utf-16"> would most likely be correct to assume that the encoding is *not* UTF-16, and would be misled if they believed the declaration. So it's still a very bad idea to check the encoding by reading this string. -- Philip Taylor pjt47@cam.ac.uk
Received on Tuesday, 27 July 2010 13:14:41 UTC