- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 23 May 2008 11:45:33 +0300
- To: Ian Hickson <ian@hixie.ch>
- Cc: HTML WG <public-html@w3.org>
On May 23, 2008, at 01:38, Ian Hickson wrote: > On Fri, 21 Mar 2008, Henri Sivonen wrote: >>> >>> The user agent may attempt to autodetect the character encoding from >>> applying frequency analysis or other algorithms to the data >>> stream. If >>> autodetection succeeds in determining a character encoding, then >>> return that encoding, with the confidence tentative, and abort these >>> steps. >> >> I think only US-ASCII superset encodings should be allowed as >> outcomes >> of heuristic encoding detection. If a page is misdetected as UTF-16, >> there's no later meta recourse. > > If the page is detected as UTF-16, the odds of it being anything > else are > extremely low, I found the situation with a real off-the-shelf detector to be different. > probably low enough that the likely benefit of detecting > the page as UTF-16 is greater than the likely benefit of being able to > recover from mis-detecting a page as UTF-16. I highly doubt this considering how relatively rare UTF-16 is on the Web. >> Consider this case that I just programmed around: >> A Russian page is encoded as Windows-1251. The page fails the meta >> prescan. A >> heuristic detector misdetects the page as UTF-16 Chinese. A later >> meta gets >> garbled and the parser output is garbage. > > Fix the heuristic. :-) Windows-1251 _really_ shouldn't ever get > detected > as UTF-16. > > Also, if we do what you suggested, then the reverse situation is as > likely: a page that is UTF-16, misdetected as Windows-1251, > resulting in > the document being garbled. Perhaps I'm being irrationally emotional here, but I think it's more forgivable to make authoring mistakes with 8-bit legacy encodings than with UTF-16, so I have far less sympathy for making bogus UTF-16 work. I don't have real stats but I'd expect undeclared Cyrillic 8-bit content to be much more common than BOMless UTF-16 content. >> I don't have statistics to back this up, but my educated guess >> based on >> anecdotal evidence is that HTTP-unlabeled UTF-16BE and UTF-16LE (i.e. >> BOMless) is very rare if not non-existent on the Web. On the other >> hand, >> Russian pages that CJK-biased detector software can misdetect as >> UTF-16 >> are a more likely occurrence on the Web. > > Well, the spec as it stands allows you to limit it to ASCII-superset- > only > if you want. However, I've heard from at least one vendor that they > needed > to detect UTF-16 (by looking for 00 3C 00 ?? and 3C 00 ?? 00 as the > first > four bytes; ?? != 00) to support some pages. I can't really see that > heuristic being triggered by Windows-1251 pages. That's sad, but if the Web requires it, perhaps the spec should mandate that exact heuristic. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Friday, 23 May 2008 08:46:22 UTC