Re: Heuristic detection and non-ASCII superset encodings

On Fri, 21 Mar 2008, Henri Sivonen wrote:
> >
> > The user agent may attempt to autodetect the character encoding from 
> > applying frequency analysis or other algorithms to the data stream. If 
> > autodetection succeeds in determining a character encoding, then 
> > return that encoding, with the confidence tentative, and abort these 
> > steps.
> 
> I think only US-ASCII superset encodings should be allowed as outcomes 
> of heuristic encoding detection. If a page is misdetected as UTF-16, 
> there's no later meta recourse.

If the page is detected as UTF-16, the odds of it being anything else are 
extremely low, probably low enough that the likely benefit of detecting 
the page as UTF-16 is greater than the likely benefit of being able to 
recover from mis-detecting a page as UTF-16.


> Consider this case that I just programmed around:
> A Russian page is encoded as Windows-1251. The page fails the meta prescan. A
> heuristic detector misdetects the page as UTF-16 Chinese. A later meta gets
> garbled and the parser output is garbage.

Fix the heuristic. :-) Windows-1251 _really_ shouldn't ever get detected 
as UTF-16.

Also, if we do what you suggested, then the reverse situation is as 
likely: a page that is UTF-16, misdetected as Windows-1251, resulting in 
the document being garbled.


> I don't have statistics to back this up, but my educated guess based on 
> anecdotal evidence is that HTTP-unlabeled UTF-16BE and UTF-16LE (i.e. 
> BOMless) is very rare if not non-existent on the Web. On the other hand, 
> Russian pages that CJK-biased detector software can misdetect as UTF-16 
> are a more likely occurrence on the Web.

Well, the spec as it stands allows you to limit it to ASCII-superset-only 
if you want. However, I've heard from at least one vendor that they needed 
to detect UTF-16 (by looking for 00 3C 00 ?? and 3C 00 ?? 00 as the first 
four bytes; ?? != 00) to support some pages. I can't really see that 
heuristic being triggered by Windows-1251 pages.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 22 May 2008 22:39:31 UTC