Re: WebVTT feedback from Ian Hickson on 2012-06-05 (public-texttracks@w3.org from June 2012)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 5 Jun 2012 22:06:37 +0000 (UTC)
To: Philip Jägenstedt <philipj@opera.com>
cc: public-texttracks@w3.org
Message-ID: <Pine.LNX.4.64.1206052205050.378@ps20323.dreamhostps.com>

On Wed, 7 Dec 2011, Philip Jägenstedt wrote:
> On Tue, 06 Dec 2011 01:38:14 +0100, Ian Hickson <ian@hixie.ch> wrote:
> > On Sat, 3 Dec 2011, Philip Jägenstedt wrote:
> > > 
> > > We're going to be doing the same script detection heuristics that we 
> > > do on web pages. Differentiating between simplified Chinese, 
> > > traditional Chinese and Japanese isn't particularly hard.
> > 
> > Can we define these for interoperability, or are they proprietary? (I 
> > don't imagine people writing their own small WebVTT implementations 
> > are going to know how to do this if we don't have a spec.)
> 
> Yes, we'd love for script detection heuristics to be specified, both for 
> HTML and WebVTT.
> 
> I'm not an expert on this, but basically what we do is traverse the 
> input (as unicode points) and count the number of hits for different 
> buckets of script families and see which wins. For separating simplified 
> and traditional Chinese (and maybe Japanese) that have a lot of overlap, 
> I believe we look for common characters that are unique for each script 
> (like 国 or 國) and see which class of characters wins.
> 
> What would be the appropriate way to proceed?

I considered trying to spec this myself, but I don't have the bandwidth to 
take on something that size at the moment. I think the best way to proceed 
would be for someone to write a specification that defines the algorithm 
that does the script detection, and then for me to update HTML and WebVTT 
to plug into that algorithm.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 5 June 2012 22:07:03 UTC