- From: Philip Jägenstedt <philipj@opera.com>
- Date: Fri, 09 Dec 2011 13:45:13 +0100
- To: public-texttracks@w3.org
On Wed, 07 Dec 2011 21:32:17 +0100, Glenn Maynard <glenn@zewt.org> wrote: > (We might want to shift lists if we want to pursue this, since the > language-detection stuff isn't WebVTT-specific.) > > On Wed, Dec 7, 2011 at 10:22 AM, Philip Jägenstedt > <philipj@opera.com>wrote: > >> I'm not an expert on this, but basically what we do is traverse the >> input >> (as unicode points) and count the number of hits for different buckets >> of >> script families and see which wins. For separating simplified and >> traditional Chinese (and maybe Japanese) that have a lot of overlap, I >> believe we look for common characters that are unique for each script >> (like >> 国 or 國) and see which class of characters wins. >> > > FWIW: > > Firefox apparently defaults to Japanese for UTF-8 CJK when the language > isn't specified. It doesn't do any complex heuristics, and it doesn't > depend on the user's locale. This seems like the optimal solution--no > heuristics (making it more predictable for users), and has none of the > locale-specific behavior that plague charsets. That doesn't sound very good at all for unlabeled simplified or traditional Chinese. > IE8 seems to be the opposite: the language defaults to the user's locale. > When not in a CJK locale, IE9 appears to default to zh-CN. > > (WebKit/Chrome don't even seem to support @lang, so it's still in the > dark > ages, and probably one reason much of Japan is still using Shift-JIS. On > quick examination Chrome appears to use zh-CN for UTF-8 on my > Japanese-locale system, so it doesn't appear to have any > locale-sensitivity, and doesn't appear to perform any codepoint > heuristics.) > > If browsers were to converge here, Firefox's seems optimal, Opera's > (assuming it has no locale sensitivity) seems okay, and IE's seems the > worst. I'm not aware of any locale specific stuff here, but I think that the character encoding plays into this somehow, such that content served as GBK or Big5 is more likely to be considered to be simplified and traditional Chinese respectively. -- Philip Jägenstedt Core Developer Opera Software
Received on Friday, 9 December 2011 12:45:43 UTC