Re: Handling live translation of cues to WebVTT from Philip Jägenstedt on 2014-01-25 (public-texttracks@w3.org from January 2014)

From: Philip Jägenstedt <philipj@opera.com>
Date: Sat, 25 Jan 2014 23:47:19 +0700
To: Brendan Long <B.Long@cablelabs.com>
Cc: "public-texttracks@w3.org" <public-texttracks@w3.org>
Message-ID: <CAMQvoCn4fXt0x8qAXADUqMkkYRV+NY7pHXFwEhFY=Dz6mCjfhg@mail.gmail.com>

On Sat, Jan 25, 2014 at 12:15 AM, Brendan Long <B.Long@cablelabs.com> wrote:
> On Fri, 2014-01-24 at 23:33 +0700, Philip Jägenstedt wrote:
>
> Having looked at the original thread, I can only guess that you don't
> want to involve scripts, since if you can rely on scripts it seems
> like you could easily do what you're asking for. What is the reason
> that you do not want to use scripts here?
>
> First, a philosophical reason: Requiring JavaScript to play a live video
> with captions seems like a huge hack. Technically, we could decode videos in
> JavaScript too, but that doesn't mean it's a good solution.
>
> Second, a practical reason: If we can produce a valid WebVTT document, then
> any web page can display it with a normal video tag. If we have to use
> JavaScript, then inevitably there will be several different ways of doing
> it, and any page that wants to use live captions from outside sources will
> need a list of JavaScript hacks to make them all work. Then, anytime a site
> that produces live captions changes its method, anything that depends on it
> will break until they update their JavaScript.. It just seems like "live
> video" is a normal enough case we shouldn't push all of this complexity on
> every site that plays them.

It's always difficult to decide which features warrant a declarative
solution and which should be left to scripts. I'm trying to understand
out what the costs to either approach are.

It seems to me that when you do live streaming, you're going to be
using Media Source Extensions, which require rather a lot of
JavaScript. In that context the script required to update the end
times of cues when the next cue comes in doesn't seem much of a
burden. In other words the cost of *not* solving this declaratively
doesn't seem very high. (I've probably misunderstood some part of the
use case, in particular "live captions from outside sources" seem
mysterious to me.)

So, what about the cost in solving this declaratively?

1. Is the special keyword NEXT for the end time the only new syntax
that's required?
2. When should the end time of a NEXTy cue be updated? Is it when a
new cue with a higher start time is parsed, or should e.g. a script
modifying the start time of an existing cue also do something?
3. Should the endTime IDL attribute actually be modified, or should it
simply be that a cue with end time NEXT is not considered active if
there are any cues with a later end time?
4. What happens when you have two cues with the same start time that
both have end time NEXT?

Depending on how this is supposed to work, it will be add more or less
complexity to the spec and implementations.

Philip

Received on Saturday, 25 January 2014 16:47:48 UTC