- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 5 Jun 2012 22:06:37 +0000 (UTC)
- To: Philip Jägenstedt <philipj@opera.com>
- cc: public-texttracks@w3.org
- Message-ID: <Pine.LNX.4.64.1206052205050.378@ps20323.dreamhostps.com>
On Wed, 7 Dec 2011, Philip Jägenstedt wrote: > On Tue, 06 Dec 2011 01:38:14 +0100, Ian Hickson <ian@hixie.ch> wrote: > > On Sat, 3 Dec 2011, Philip Jägenstedt wrote: > > > > > > We're going to be doing the same script detection heuristics that we > > > do on web pages. Differentiating between simplified Chinese, > > > traditional Chinese and Japanese isn't particularly hard. > > > > Can we define these for interoperability, or are they proprietary? (I > > don't imagine people writing their own small WebVTT implementations > > are going to know how to do this if we don't have a spec.) > > Yes, we'd love for script detection heuristics to be specified, both for > HTML and WebVTT. > > I'm not an expert on this, but basically what we do is traverse the > input (as unicode points) and count the number of hits for different > buckets of script families and see which wins. For separating simplified > and traditional Chinese (and maybe Japanese) that have a lot of overlap, > I believe we look for common characters that are unique for each script > (like 国 or 國) and see which class of characters wins. > > What would be the appropriate way to proceed? I considered trying to spec this myself, but I don't have the bandwidth to take on something that size at the moment. I think the best way to proceed would be for someone to write a specification that defines the algorithm that does the script detection, and then for me to update HTML and WebVTT to plug into that algorithm. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 5 June 2012 22:07:03 UTC