Re: WebVTT feedback from Glenn Maynard on 2011-12-07 (public-texttracks@w3.org from December 2011)

From: Glenn Maynard <glenn@zewt.org>
Date: Wed, 7 Dec 2011 15:32:17 -0500
To: Philip Jägenstedt <philipj@opera.com>
Cc: public-texttracks@w3.org
Message-ID: <CABirCh8E7dLkZGAYGWzc_3eXQi2GfMp9awSdGKNc_Ph0qvKcoA@mail.gmail.com>

(We might want to shift lists if we want to pursue this, since the
language-detection stuff isn't WebVTT-specific.)

On Wed, Dec 7, 2011 at 10:22 AM, Philip Jägenstedt <philipj@opera.com>wrote:

> I'm not an expert on this, but basically what we do is traverse the input
> (as unicode points) and count the number of hits for different buckets of
> script families and see which wins. For separating simplified and
> traditional Chinese (and maybe Japanese) that have a lot of overlap, I
> believe we look for common characters that are unique for each script (like
> 国 or 國) and see which class of characters wins.
>

FWIW:

Firefox apparently defaults to Japanese for UTF-8 CJK when the language
isn't specified.  It doesn't do any complex heuristics, and it doesn't
depend on the user's locale.  This seems like the optimal solution--no
heuristics (making it more predictable for users), and has none of the
locale-specific behavior that plague charsets.

IE8 seems to be the opposite: the language defaults to the user's locale.
When not in a CJK locale, IE9 appears to default to zh-CN.

(WebKit/Chrome don't even seem to support @lang, so it's still in the dark
ages, and probably one reason much of Japan is still using Shift-JIS.  On
quick examination Chrome appears to use zh-CN for UTF-8 on my
Japanese-locale system, so it doesn't appear to have any
locale-sensitivity, and doesn't appear to perform any codepoint heuristics.)

If browsers were to converge here, Firefox's seems optimal, Opera's
(assuming it has no locale sensitivity) seems okay, and IE's seems the
worst.

-- 
Glenn Maynard

Received on Wednesday, 7 December 2011 20:32:45 UTC