- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 22 May 2008 22:38:51 +0000 (UTC)
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: HTML WG <public-html@w3.org>
On Fri, 21 Mar 2008, Henri Sivonen wrote: > > > > The user agent may attempt to autodetect the character encoding from > > applying frequency analysis or other algorithms to the data stream. If > > autodetection succeeds in determining a character encoding, then > > return that encoding, with the confidence tentative, and abort these > > steps. > > I think only US-ASCII superset encodings should be allowed as outcomes > of heuristic encoding detection. If a page is misdetected as UTF-16, > there's no later meta recourse. If the page is detected as UTF-16, the odds of it being anything else are extremely low, probably low enough that the likely benefit of detecting the page as UTF-16 is greater than the likely benefit of being able to recover from mis-detecting a page as UTF-16. > Consider this case that I just programmed around: > A Russian page is encoded as Windows-1251. The page fails the meta prescan. A > heuristic detector misdetects the page as UTF-16 Chinese. A later meta gets > garbled and the parser output is garbage. Fix the heuristic. :-) Windows-1251 _really_ shouldn't ever get detected as UTF-16. Also, if we do what you suggested, then the reverse situation is as likely: a page that is UTF-16, misdetected as Windows-1251, resulting in the document being garbled. > I don't have statistics to back this up, but my educated guess based on > anecdotal evidence is that HTTP-unlabeled UTF-16BE and UTF-16LE (i.e. > BOMless) is very rare if not non-existent on the Web. On the other hand, > Russian pages that CJK-biased detector software can misdetect as UTF-16 > are a more likely occurrence on the Web. Well, the spec as it stands allows you to limit it to ASCII-superset-only if you want. However, I've heard from at least one vendor that they needed to detect UTF-16 (by looking for 00 3C 00 ?? and 3C 00 ?? 00 as the first four bytes; ?? != 00) to support some pages. I can't really see that heuristic being triggered by Windows-1251 pages. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 22 May 2008 22:39:31 UTC