Re: WebVTT feedback

On Wed, 07 Dec 2011 21:32:17 +0100, Glenn Maynard <glenn@zewt.org> wrote:

> (We might want to shift lists if we want to pursue this, since the
> language-detection stuff isn't WebVTT-specific.)
>
> On Wed, Dec 7, 2011 at 10:22 AM, Philip Jägenstedt  
> <philipj@opera.com>wrote:
>
>> I'm not an expert on this, but basically what we do is traverse the  
>> input
>> (as unicode points) and count the number of hits for different buckets  
>> of
>> script families and see which wins. For separating simplified and
>> traditional Chinese (and maybe Japanese) that have a lot of overlap, I
>> believe we look for common characters that are unique for each script  
>> (like
>> 国 or 國) and see which class of characters wins.
>>
>
> FWIW:
>
> Firefox apparently defaults to Japanese for UTF-8 CJK when the language
> isn't specified.  It doesn't do any complex heuristics, and it doesn't
> depend on the user's locale.  This seems like the optimal solution--no
> heuristics (making it more predictable for users), and has none of the
> locale-specific behavior that plague charsets.

That doesn't sound very good at all for unlabeled simplified or  
traditional Chinese.

> IE8 seems to be the opposite: the language defaults to the user's locale.
> When not in a CJK locale, IE9 appears to default to zh-CN.
>
> (WebKit/Chrome don't even seem to support @lang, so it's still in the  
> dark
> ages, and probably one reason much of Japan is still using Shift-JIS.  On
> quick examination Chrome appears to use zh-CN for UTF-8 on my
> Japanese-locale system, so it doesn't appear to have any
> locale-sensitivity, and doesn't appear to perform any codepoint  
> heuristics.)
>
> If browsers were to converge here, Firefox's seems optimal, Opera's
> (assuming it has no locale sensitivity) seems okay, and IE's seems the
> worst.

I'm not aware of any locale specific stuff here, but I think that the  
character encoding plays into this somehow, such that content served as  
GBK or Big5 is more likely to be considered to be simplified and  
traditional Chinese respectively.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Friday, 9 December 2011 12:45:43 UTC