Re: WebVTT

On Mon, Jun 17, 2013 at 4:46 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>
>>> This is a point of divergence, TTML cues have no internal time
>>> structure, everything in one cue happens "in the now", any observable
>>> change in the on screen display happens from one cue to the next, and
>>> there is an event for that.
>>
>>How do you do karaoke then? Is every new word a new cue?
>
> So in TTML you don't write the cues directly, you effectively write a specification that gets turned into cues, somewhat like unrolling a for loop into inline assembler. For example you might write something like:
>
> <p begin="11.3s" end="13.4s">
>   <span><set begin="0.1s" style="past"/>Somewhere</span>
>   <span><set begin="0.6s" style="past"/>over</span>
>   <span><set begin="1.1s" style="past"/>the</span>
>   <span><set begin="1.7s" style="past"/>rainbow</span>
> </p>
>
> What I'm thinking about is how to represent this as serialised cues. Currently it will get turned into 4 cues each of which contains the whole lyric, but styled differently. It may be that the "wavefront" model is the right one to use here, and in some cases it would be possible to use it, however the TTML model can express things that go beyond the wavefront model, so I'm not sure whether this is going to work in general.

Is the extra functionality useful for HTML? If not, it sounds like the
"wavefront" model would be sufficient.


>>> In contrast WebVTT has effectively a wavefront that passes through it
>>> dividing a cue into three regions past, present and future over time,
>
>>Only for the cues where text is successively displayed or changed. The text is already positioned and rendered in this case,
>>so only the style parameters are updated during this "wavefront pass-through".
>
> Right, although I can see authors using this to create follow on captions:
>
> 00:00:00.000 --> 00:00:08:00
> <v john>So it was you I saw in the shrubbery!</v><00:00:06:00>
> <v mary>                   -- Yes.</v>
>
> So although "               --Yes" might already be in position, it might not be styled to appear until after 6s to model the pause.

Yes, that's a valid use.


> Paint in captions in TTML work similarly to the karaoke example, but we don't need to use style animation in this case as the timing will control presence in the render tree:
>
> <p begin="11.3s" end="13.4s">
>   <span begin="0.1s">Somewhere</span>
>   <span begin="0.6s">over</span>
>   <span begin="1.1s">the</span>
>   <span begin="1.7s">rainbow</span>
> </p>

So you represent Karaoke differently from paint-on captions?

Also, how do you deal with the rendering problem when the future text
is not in the render tree. For example, you render the first word in a
centered cue. If the other words are not yet present in the render
tree, the first word is rendered in the middle. Then successively the
words are added, which means the text has to get re-balanced and the
first word move successively to the left.

>>(yes, I need to add an example: just filed a bug
>>https://www.w3.org/Bugs/Public/show_bug.cgi?id=22389)
>
> Yes I think this needs a lot more explanation, as well as noting that its illegal syntax per CSS2&3 to have a selector in a pseudo argument; so you will have to document the willful violation or get this put into CSS 4 selectors.

Oh!? Need to investigate...


>>Note that 'transform' is not an allowed CSS property on cues:
>>http://dev.w3.org/html5/webvtt/#the-cue-pseudo-element
>
> Right, but it is allowed on cue selectors with an argument, which is what I am using; I'm scaling the WebVTT Internal Node Object for the c tags.
>
> "The '::cue(selector)' pseudo-element with an argument must have an argument that consists of a group of selectors.
> It matches any WebVTT Internal Node Object constructed for the matched element that also matches the given group of selectors"

And further down:
The following properties apply to the '::cue()' pseudo-element with an argument:
'color'
'opacity'
'visibility'
'text-decoration'
'text-outline'
'text-shadow'
the properties corresponding to the 'background' shorthand
the properties corresponding to the 'outline' shorthand
properties relating to the transition and animation features


Thus it allows "properties relating to the transition and animation
features", but not transforms, though it would certainly be possible.


>>Nevertheless - I don't see a problem with this - it's perfectly legal markup. If that's how the JS dev wants it, then that's fine.
>>If they want events raised, they have to put the images into separate cues.
>>
>>But I still wonder what they would need the event raising for? The timing of the individual slides has already been set up -
>>there is no need to interrupt this flow in JS and do something.
>
> No, possibly not. As I said when we built the track model I don't think JS should really be used except for metadata tracks, but they are there. What I'm asking is whether they need to be preserved in translation.

Right. I don't think so.


>>>>Why would a JS dev need to be alerted of CSS changes? What's the use case?
>>>
>>> Well I'm not sure I've ever been sold on the use case for JS having
>>> access into the captions at all actually, except maybe for metadata
>>> tracks, but if they do, one would assume you would want to provide
>>> full access to the semantics of the format; of course you don't have
>>> to, that's really up to you and the community that sees this as a
>>> valuable format.  One specific use case might be putting the visible
>>> text into an ARIA live region at the right time.
>>
>>That's a use case. Right now, you would hook that up to the video timeline and not a raised event from the text track.
>>Also, because synthesized speech is successive anyway, you would likely just hand the full text from the cue to the
>>ARIA live region in one go and leave it to the natural passing of time to render the text successively.
>
> Well how would you know what times to use to hook up to the video timeline without searching the getCueAsHtml() fragment for the time stamps?

Yes, you need to also look at the cue's content.

> How would the speech engine know to put a pause in before mary's response in the above example?

It wouldn't - there is no markup for pauses yet. And trying to control
voicing through the cue-internal time stamps is bound to fail because
we never know the speed of the speech synthesis voice. We certainly
have some work to do to render cues of kind=descriptions .


>>Don't get me wrong: we'd want to introduce an event if there is a large developer need. But I haven't come across one yet.
>>Also, it can still be done in the future.
>
> Well yes anything can potentially be done in the future provided its not violating the laws of physics, the point is I am trying to map the models now. So I'm asking the question does this translation cause a need or not?

s/physics/compatibility/

If both the TTML and the WebVTT follow the same model, it will be
easier to introduce a consistent solution in the future if it was
required. In this case it would actually be really easy to just
introduce a cueTimestamp event.


>>> From my perspective though, it means that if I am translating the
>>> above example, I cannot use tools exposed by a browser, I'm going to
>>> have to grub around in the internal structure of the displayed HTML
>>> and try and find the timestamps and set up my own handler on the video
>>> timeline events and then reverse out the CSS on the HTML fragments.
>>> Makes life just a bit harder is all.
>
>>Yes it does. Question is really: is that any different for TTML and how do you manage that?
>
> Precisely. Currently we get an event for every visual change in TTML, because every visual change goes in a new cue.

There are some problems with that approach for rendering, which I've
tried to point to above. So, as you define your cue format for TTML,
it may be better to move such paint-on captions all into one cue.


> If we are translating to VTT, should we attempt to use the wavefront model when its possible, or perhaps even generalise that model so that cues can express more complex inner timings.

Yes, I would suggest using the wavefront model. How would you
generalise the inner timings model and for what use case?

> If the events arent useful or used, then attempting to coalesce cues makes some sense, if authors would use and expect them, then how do we recreate them in the VTT.

Let's cross the bridge of having events when we see a user need.


> I don't have a strong opinion either way, but I would like to have at least thought through the options and take a decision deliberately.

Sure! See if my reasoning makes sense to you.

Cheers,
Silvia.

Received on Monday, 17 June 2013 07:48:45 UTC