- From: Philip Jägenstedt <philipj@opera.com>
- Date: Wed, 07 Dec 2011 16:22:00 +0100
- To: public-texttracks@w3.org
On Tue, 06 Dec 2011 01:38:14 +0100, Ian Hickson <ian@hixie.ch> wrote: > On Sat, 3 Dec 2011, Philip Jägenstedt wrote: >> >> We're going to be doing the same script detection heuristics that we do >> on web pages. Differentiating between simplified Chinese, traditional >> Chinese and Japanese isn't particularly hard. > > Can we define these for interoperability, or are they proprietary? (I > don't imagine people writing their own small WebVTT implementations are > going to know how to do this if we don't have a spec.) Yes, we'd love for script detection heuristics to be specified, both for HTML and WebVTT. I'm not an expert on this, but basically what we do is traverse the input (as unicode points) and count the number of hits for different buckets of script families and see which wins. For separating simplified and traditional Chinese (and maybe Japanese) that have a lot of overlap, I believe we look for common characters that are unique for each script (like 国 or 國) and see which class of characters wins. What would be the appropriate way to proceed? -- Philip Jägenstedt Core Developer Opera Software
Received on Wednesday, 7 December 2011 15:22:36 UTC