RE: WebVTT from Sean Hayes on 2013-06-17 (public-tt@w3.org from June 2013)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Mon, 17 Jun 2013 09:42:01 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: John Birch <John.Birch@screensystems.tv>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <E9A92BD0A4FC934EB7935470A46D15241F6AAFDE@DB3EX14MBXC323.europe.corp.microsoft.c>
>So you represent Karaoke differently from paint-on captions?
Well it depends on what you want it to look like, it would be possible to represent it the same way if you chose to style the display property manually for paint on.

>Also, how do you deal with the rendering problem when the future text is not in the render tree. For example, you render the first word in a centered cue. If the other words are not yet present in the render tree, the first word is rendered in the middle. Then successively the words are added, which means the text has to get re-balanced and the first word move successively to the left.

There is no rendering problem here if you model as we do. Each cue is simply a static piece of HTML. You take one out and replace it with the next. Your observation is however why paint on styling is left justified, But yes it is possible to do text as you describe in TTML, and a myriad of other possibilities.. Which is precisely why it's not possible to map general TTML to the wavefront model. To generalize it I think would be to have a mini TTML renderer in each cue, which then begs the question of the cue model at all, so I don't think we'll go there.

>Is the extra functionality useful for HTML? If not, it sounds like the "wavefront" model would be sufficient.
That's not really the right question unfortunately. The TTML author has the ability to write it, so the question is how to model it. It seems we aren't worried about the events, so I guess we'll just accept that converting to VTT from TTML will generate less events. 

>properties relating to the transition and animation features
OK, yes I tend to think of animations and transforms as one thing, but indeed they are different specs. It might be helpful to add text indicating precisely which attributes can go on the animations.

>Yes, I would suggest using the wavefront model. How would you generalise the inner timings model and for what use case?

Well that's a very good question, I'm not sure whether I would; as I said you would essentially end up with a mini TTML in a cue. So I think I will leave it as it is.

>It wouldn't - there is no markup for pauses yet. And trying to control voicing through the cue-internal time stamps is bound to fail because we never know the speed of the speech synthesis voice. We certainly have some work to do to render cues of kind=descriptions .

Yes I agree there's interesting work to be done there.

Cheers,
Sean

-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com] 
Sent: 17 June 2013 08:48
To: Sean Hayes
Cc: John Birch; public-tt@w3.org
Subject: Re: WebVTT

On Mon, Jun 17, 2013 at 4:46 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>
>>> This is a point of divergence, TTML cues have no internal time 
>>> structure, everything in one cue happens "in the now", any 
>>> observable change in the on screen display happens from one cue to 
>>> the next, and there is an event for that.
>>
>>How do you do karaoke then? Is every new word a new cue?
>
> So in TTML you don't write the cues directly, you effectively write a specification that gets turned into cues, somewhat like unrolling a for loop into inline assembler. For example you might write something like:
>
> <p begin="11.3s" end="13.4s">
>   <span><set begin="0.1s" style="past"/>Somewhere</span>
>   <span><set begin="0.6s" style="past"/>over</span>
>   <span><set begin="1.1s" style="past"/>the</span>
>   <span><set begin="1.7s" style="past"/>rainbow</span> </p>
>
> What I'm thinking about is how to represent this as serialised cues. Currently it will get turned into 4 cues each of which contains the whole lyric, but styled differently. It may be that the "wavefront" model is the right one to use here, and in some cases it would be possible to use it, however the TTML model can express things that go beyond the wavefront model, so I'm not sure whether this is going to work in general.

Is the extra functionality useful for HTML? If not, it sounds like the "wavefront" model would be sufficient.


>>> In contrast WebVTT has effectively a wavefront that passes through 
>>> it dividing a cue into three regions past, present and future over 
>>> time,
>
>>Only for the cues where text is successively displayed or changed. The 
>>text is already positioned and rendered in this case, so only the style parameters are updated during this "wavefront pass-through".
>
> Right, although I can see authors using this to create follow on captions:
>
> 00:00:00.000 --> 00:00:08:00
> <v john>So it was you I saw in the shrubbery!</v><00:00:06:00>
> <v mary>                   -- Yes.</v>
>
> So although "               --Yes" might already be in position, it might not be styled to appear until after 6s to model the pause.

Yes, that's a valid use.


> Paint in captions in TTML work similarly to the karaoke example, but we don't need to use style animation in this case as the timing will control presence in the render tree:
>
> <p begin="11.3s" end="13.4s">
>   <span begin="0.1s">Somewhere</span>
>   <span begin="0.6s">over</span>
>   <span begin="1.1s">the</span>
>   <span begin="1.7s">rainbow</span>
> </p>

So you represent Karaoke differently from paint-on captions?

Also, how do you deal with the rendering problem when the future text is not in the render tree. For example, you render the first word in a centered cue. If the other words are not yet present in the render tree, the first word is rendered in the middle. Then successively the words are added, which means the text has to get re-balanced and the first word move successively to the left.

>>(yes, I need to add an example: just filed a bug
>>https://www.w3.org/Bugs/Public/show_bug.cgi?id=22389)
>
> Yes I think this needs a lot more explanation, as well as noting that its illegal syntax per CSS2&3 to have a selector in a pseudo argument; so you will have to document the willful violation or get this put into CSS 4 selectors.

Oh!? Need to investigate...


>>Note that 'transform' is not an allowed CSS property on cues:
>>http://dev.w3.org/html5/webvtt/#the-cue-pseudo-element
>
> Right, but it is allowed on cue selectors with an argument, which is what I am using; I'm scaling the WebVTT Internal Node Object for the c tags.
>
> "The '::cue(selector)' pseudo-element with an argument must have an argument that consists of a group of selectors.
> It matches any WebVTT Internal Node Object constructed for the matched element that also matches the given group of selectors"

And further down:
The following properties apply to the '::cue()' pseudo-element with an argument:
'color'
'opacity'
'visibility'
'text-decoration'
'text-outline'
'text-shadow'
the properties corresponding to the 'background' shorthand the properties corresponding to the 'outline' shorthand properties relating to the transition and animation features


Thus it allows "properties relating to the transition and animation features", but not transforms, though it would certainly be possible.


>>Nevertheless - I don't see a problem with this - it's perfectly legal markup. If that's how the JS dev wants it, then that's fine.
>>If they want events raised, they have to put the images into separate cues.
>>
>>But I still wonder what they would need the event raising for? The 
>>timing of the individual slides has already been set up - there is no need to interrupt this flow in JS and do something.
>
> No, possibly not. As I said when we built the track model I don't think JS should really be used except for metadata tracks, but they are there. What I'm asking is whether they need to be preserved in translation.

Right. I don't think so.


>>>>Why would a JS dev need to be alerted of CSS changes? What's the use case?
>>>
>>> Well I'm not sure I've ever been sold on the use case for JS having 
>>> access into the captions at all actually, except maybe for metadata 
>>> tracks, but if they do, one would assume you would want to provide 
>>> full access to the semantics of the format; of course you don't have 
>>> to, that's really up to you and the community that sees this as a 
>>> valuable format.  One specific use case might be putting the visible 
>>> text into an ARIA live region at the right time.
>>
>>That's a use case. Right now, you would hook that up to the video timeline and not a raised event from the text track.
>>Also, because synthesized speech is successive anyway, you would 
>>likely just hand the full text from the cue to the ARIA live region in one go and leave it to the natural passing of time to render the text successively.
>
> Well how would you know what times to use to hook up to the video timeline without searching the getCueAsHtml() fragment for the time stamps?

Yes, you need to also look at the cue's content.

> How would the speech engine know to put a pause in before mary's response in the above example?

It wouldn't - there is no markup for pauses yet. And trying to control voicing through the cue-internal time stamps is bound to fail because we never know the speed of the speech synthesis voice. We certainly have some work to do to render cues of kind=descriptions .


>>Don't get me wrong: we'd want to introduce an event if there is a large developer need. But I haven't come across one yet.
>>Also, it can still be done in the future.
>
> Well yes anything can potentially be done in the future provided its not violating the laws of physics, the point is I am trying to map the models now. So I'm asking the question does this translation cause a need or not?

s/physics/compatibility/

If both the TTML and the WebVTT follow the same model, it will be easier to introduce a consistent solution in the future if it was required. In this case it would actually be really easy to just introduce a cueTimestamp event.


>>> From my perspective though, it means that if I am translating the 
>>> above example, I cannot use tools exposed by a browser, I'm going to 
>>> have to grub around in the internal structure of the displayed HTML 
>>> and try and find the timestamps and set up my own handler on the 
>>> video timeline events and then reverse out the CSS on the HTML fragments.
>>> Makes life just a bit harder is all.
>
>>Yes it does. Question is really: is that any different for TTML and how do you manage that?
>
> Precisely. Currently we get an event for every visual change in TTML, because every visual change goes in a new cue.

There are some problems with that approach for rendering, which I've tried to point to above. So, as you define your cue format for TTML, it may be better to move such paint-on captions all into one cue.


> If we are translating to VTT, should we attempt to use the wavefront model when its possible, or perhaps even generalise that model so that cues can express more complex inner timings.

Yes, I would suggest using the wavefront model. How would you generalise the inner timings model and for what use case?

> If the events arent useful or used, then attempting to coalesce cues makes some sense, if authors would use and expect them, then how do we recreate them in the VTT.

Let's cross the bridge of having events when we see a user need.


> I don't have a strong opinion either way, but I would like to have at least thought through the options and take a decision deliberately.

Sure! See if my reasoning makes sense to you.

Cheers,
Silvia.
Received on Monday, 17 June 2013 09:42:46 UTC