- From: Glenn Maynard <glenn@zewt.org>
- Date: Wed, 7 Dec 2011 15:32:17 -0500
- To: Philip Jägenstedt <philipj@opera.com>
- Cc: public-texttracks@w3.org
- Message-ID: <CABirCh8E7dLkZGAYGWzc_3eXQi2GfMp9awSdGKNc_Ph0qvKcoA@mail.gmail.com>
(We might want to shift lists if we want to pursue this, since the language-detection stuff isn't WebVTT-specific.) On Wed, Dec 7, 2011 at 10:22 AM, Philip Jägenstedt <philipj@opera.com>wrote: > I'm not an expert on this, but basically what we do is traverse the input > (as unicode points) and count the number of hits for different buckets of > script families and see which wins. For separating simplified and > traditional Chinese (and maybe Japanese) that have a lot of overlap, I > believe we look for common characters that are unique for each script (like > 国 or 國) and see which class of characters wins. > FWIW: Firefox apparently defaults to Japanese for UTF-8 CJK when the language isn't specified. It doesn't do any complex heuristics, and it doesn't depend on the user's locale. This seems like the optimal solution--no heuristics (making it more predictable for users), and has none of the locale-specific behavior that plague charsets. IE8 seems to be the opposite: the language defaults to the user's locale. When not in a CJK locale, IE9 appears to default to zh-CN. (WebKit/Chrome don't even seem to support @lang, so it's still in the dark ages, and probably one reason much of Japan is still using Shift-JIS. On quick examination Chrome appears to use zh-CN for UTF-8 on my Japanese-locale system, so it doesn't appear to have any locale-sensitivity, and doesn't appear to perform any codepoint heuristics.) If browsers were to converge here, Firefox's seems optimal, Opera's (assuming it has no locale sensitivity) seems okay, and IE's seems the worst. -- Glenn Maynard
Received on Wednesday, 7 December 2011 20:32:45 UTC