Re: Heuristic detection and non-ASCII superset encodings

On May 23, 2008, at 01:38, Ian Hickson wrote:

> On Fri, 21 Mar 2008, Henri Sivonen wrote:
>>>
>>> The user agent may attempt to autodetect the character encoding from
>>> applying frequency analysis or other algorithms to the data  
>>> stream. If
>>> autodetection succeeds in determining a character encoding, then
>>> return that encoding, with the confidence tentative, and abort these
>>> steps.
>>
>> I think only US-ASCII superset encodings should be allowed as  
>> outcomes
>> of heuristic encoding detection. If a page is misdetected as UTF-16,
>> there's no later meta recourse.
>
> If the page is detected as UTF-16, the odds of it being anything  
> else are
> extremely low,

I found the situation with a real off-the-shelf detector to be  
different.

> probably low enough that the likely benefit of detecting
> the page as UTF-16 is greater than the likely benefit of being able to
> recover from mis-detecting a page as UTF-16.

I highly doubt this considering how relatively rare UTF-16 is on the  
Web.

>> Consider this case that I just programmed around:
>> A Russian page is encoded as Windows-1251. The page fails the meta  
>> prescan. A
>> heuristic detector misdetects the page as UTF-16 Chinese. A later  
>> meta gets
>> garbled and the parser output is garbage.
>
> Fix the heuristic. :-) Windows-1251 _really_ shouldn't ever get  
> detected
> as UTF-16.
>
> Also, if we do what you suggested, then the reverse situation is as
> likely: a page that is UTF-16, misdetected as Windows-1251,  
> resulting in
> the document being garbled.

Perhaps I'm being irrationally emotional here, but I think it's more  
forgivable to make authoring mistakes with 8-bit legacy encodings than  
with UTF-16, so I have far less sympathy for making bogus UTF-16 work.  
I don't have real stats but I'd expect undeclared Cyrillic 8-bit  
content to be much more common than BOMless UTF-16 content.

>> I don't have statistics to back this up, but my educated guess  
>> based on
>> anecdotal evidence is that HTTP-unlabeled UTF-16BE and UTF-16LE (i.e.
>> BOMless) is very rare if not non-existent on the Web. On the other  
>> hand,
>> Russian pages that CJK-biased detector software can misdetect as  
>> UTF-16
>> are a more likely occurrence on the Web.
>
> Well, the spec as it stands allows you to limit it to ASCII-superset- 
> only
> if you want. However, I've heard from at least one vendor that they  
> needed
> to detect UTF-16 (by looking for 00 3C 00 ?? and 3C 00 ?? 00 as the  
> first
> four bytes; ?? != 00) to support some pages. I can't really see that
> heuristic being triggered by Windows-1251 pages.


That's sad, but if the Web requires it, perhaps the spec should  
mandate that exact heuristic.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 23 May 2008 08:46:22 UTC