- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 21 Mar 2008 11:55:58 +0200
- To: HTML WG <public-html@w3.org>
> The user agent may attempt to autodetect the character encoding from > applying frequency analysis or other algorithms to the data stream. > If autodetection succeeds in determining a character encoding, then > return that encoding, with the confidence tentative, and abort these > steps. I think only US-ASCII superset encodings should be allowed as outcomes of heuristic encoding detection. If a page is misdetected as UTF-16, there's no later meta recourse. Consider this case that I just programmed around: A Russian page is encoded as Windows-1251. The page fails the meta prescan. A heuristic detector misdetects the page as UTF-16 Chinese. A later meta gets garbled and the parser output is garbage. When only US-ASCII supersets can be detected, a later meta will set things right even if the heuristic detector fails. I don't have statistics to back this up, but my educated guess based on anecdotal evidence is that HTTP-unlabeled UTF-16BE and UTF-16LE (i.e. BOMless) is very rare if not non-existent on the Web. On the other hand, Russian pages that CJK-biased detector software can misdetect as UTF-16 are a more likely occurrence on the Web. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Friday, 21 March 2008 09:56:45 UTC