Re: WebVTT feedback from Philip Jägenstedt on 2011-12-07 (public-texttracks@w3.org from December 2011)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 07 Dec 2011 16:22:00 +0100
To: public-texttracks@w3.org
Message-ID: <op.v54kyyqbsr6mfa@kirk>

On Tue, 06 Dec 2011 01:38:14 +0100, Ian Hickson <ian@hixie.ch> wrote:

> On Sat, 3 Dec 2011, Philip Jägenstedt wrote:
>>
>> We're going to be doing the same script detection heuristics that we do
>> on web pages. Differentiating between simplified Chinese, traditional
>> Chinese and Japanese isn't particularly hard.
>
> Can we define these for interoperability, or are they proprietary? (I
> don't imagine people writing their own small WebVTT implementations are
> going to know how to do this if we don't have a spec.)

Yes, we'd love for script detection heuristics to be specified, both for  
HTML and WebVTT.

I'm not an expert on this, but basically what we do is traverse the input  
(as unicode points) and count the number of hits for different buckets of  
script families and see which wins. For separating simplified and  
traditional Chinese (and maybe Japanese) that have a lot of overlap, I  
believe we look for common characters that are unique for each script  
(like 国 or 國) and see which class of characters wins.

What would be the appropriate way to proceed?

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Wednesday, 7 December 2011 15:22:36 UTC