RE: WebVTT from Sean Hayes on 2013-06-17 (public-tt@w3.org from June 2013)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Mon, 17 Jun 2013 06:46:21 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: John Birch <John.Birch@screensystems.tv>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <E9A92BD0A4FC934EB7935470A46D15241F6AAEAC@DB3EX14MBXC323.europe.corp.microsoft.c>
>> This is a point of divergence, TTML cues have no internal time 
>> structure, everything in one cue happens "in the now", any observable 
>> change in the on screen display happens from one cue to the next, and 
>> there is an event for that.
>
>How do you do karaoke then? Is every new word a new cue?

So in TTML you don't write the cues directly, you effectively write a specification that gets turned into cues, somewhat like unrolling a for loop into inline assembler. For example you might write something like:

<p begin="11.3s" end="13.4s">
  <span><set begin="0.1s" style="past"/>Somewhere</span>
  <span><set begin="0.6s" style="past"/>over</span>
  <span><set begin="1.1s" style="past"/>the</span>
  <span><set begin="1.7s" style="past"/>rainbow</span>
</p>

What I'm thinking about is how to represent this as serialised cues. Currently it will get turned into 4 cues each of which contains the whole lyric, but styled differently. It may be that the "wavefront" model is the right one to use here, and in some cases it would be possible to use it, however the TTML model can express things that go beyond the wavefront model, so I'm not sure whether this is going to work in general.

>> In contrast WebVTT has effectively a wavefront that passes through it 
>> dividing a cue into three regions past, present and future over time,

>Only for the cues where text is successively displayed or changed. The text is already positioned and rendered in this case, 
>so only the style parameters are updated during this "wavefront pass-through".

Right, although I can see authors using this to create follow on captions:

00:00:00.000 --> 00:00:08:00
<v john>So it was you I saw in the shrubbery!</v><00:00:06:00>
<v mary>                   -- Yes.</v>

So although "               --Yes" might already be in position, it might not be styled to appear until after 6s to model the pause.

>> therefore observable changes can happen within a cue, and there are no 
>> events for those changes.
>>
>> In order to convert from one to the other we have to ascertain whether 
>> this is significant, and it may not be. But in my opinion a caption 
>> format expresses its semantics through visual presentation over time, 
>> (well at least today it does - we have discussed in the past a higher 
>> level semantic encoding, but that's never really got anywhere). Thus 
>> time based semantic information is being conveyed within WebVTT cues 
>> which appears to be hidden;

>It's not hidden - for the in-cue time stamps it's just a bit more difficult. 
>You can us the timeupdate event of the <video> element and then compare the currentTime with the time 
>stamps in the cue to find out where the current display position is.

Yes, that's what I'm saying, you also have to reverse engineer the display styles too to see when the text is visible.

>> this could be a problem if I am using AT to follow the captions for example.
>> Anyway the question becomes how do we translate this.

>OK, if this is the key issue, then we need to understand how this works in TTML. How do you mark up 
>paint-in captions? And what do you need the events for?

I don't need the events per-se, as I say I'm not approaching this as a user; but they are defined in the HTML5 track model, so what I'm trying to ascertain is whether we need to preserve the same set of events when translating. 

Paint in captions in TTML work similarly to the karaoke example, but we don't need to use style animation in this case as the timing will control presence in the render tree:

<p begin="11.3s" end="13.4s">
  <span begin="0.1s">Somewhere</span>
  <span begin="0.6s">over</span>
  <span begin="1.1s">the</span>
  <span begin="1.7s">rainbow</span>
</p>

>>>Also, these CSS classes don't necessarily make text invisible - they 
>>>may  just change its color or font weight (think of karaoke).
>>
>> I can see what they can do, and amongst the things they can do is make 
>> text visible/invisible. Part of the exercise here is to, as much as 
>> possible, predict what a user might do and come to rely on. One thing 
>> I have learned over the years is never to underestimate the creativity 
>> of users to completely blindside you with what they do with the tools you create.
>>
>> For example, it would be perfectly legal, given the current text, for 
>> a user have just a single cue and rely only on CSS to convey a slideshow of images.
>>
>> 00:00:00.000 --> 00:10:00:00
>>
>> <c.a>&nbsp;</c><00:00:01.000><c.b>&nbsp;</c><00:00:02.000><c.c>&nbsp;<
>> /c><00:00:03.000><c.d>&nbsp;</c>...
>>
>>
>>
>> Where
>>
>>
>>
>> ::cue(c)::past {
>>
>>             transform:scale(0,0);
>>
>> }

>This won't do anything - it should be:
>::cue(:past) {...}

Yes I spotted the typo and fixed in my follow up, although to be honest since this is not very well documented, I could plausibly read the trailing :past as being mapped over the selector in the argument,

>(yes, I need to add an example: just filed a bug
>https://www.w3.org/Bugs/Public/show_bug.cgi?id=22389)

Yes I think this needs a lot more explanation, as well as noting that its illegal syntax per CSS2&3 to have a selector in a pseudo argument; so you will have to document the willful violation or get this put into CSS 4 selectors.

>> ::cue(c)::future {
>>
>>             transform:scale(0,0);
>>
>> }
>
>Same here: needs to be:
>::cue(:future) {...}

Yup


>Note that 'transform' is not an allowed CSS property on cues:
>http://dev.w3.org/html5/webvtt/#the-cue-pseudo-element

Right, but it is allowed on cue selectors with an argument, which is what I am using; I'm scaling the WebVTT Internal Node Object for the c tags.

"The '::cue(selector)' pseudo-element with an argument must have an argument that consists of a group of selectors. 
It matches any WebVTT Internal Node Object constructed for the matched element that also matches the given group of selectors"

>Nevertheless - I don't see a problem with this - it's perfectly legal markup. If that's how the JS dev wants it, then that's fine.
>If they want events raised, they have to put the images into separate cues.
>
>But I still wonder what they would need the event raising for? The timing of the individual slides has already been set up - 
>there is no need to interrupt this flow in JS and do something.

No, possibly not. As I said when we built the track model I don't think JS should really be used except for metadata tracks, but they are there. What I'm asking is whether they need to be preserved in translation. 

>>>Why would a JS dev need to be alerted of CSS changes? What's the use case?
>>
>> Well I'm not sure I've ever been sold on the use case for JS having 
>> access into the captions at all actually, except maybe for metadata 
>> tracks, but if they do, one would assume you would want to provide 
>> full access to the semantics of the format; of course you don't have 
>> to, that's really up to you and the community that sees this as a 
>> valuable format.  One specific use case might be putting the visible 
>> text into an ARIA live region at the right time.
>
>That's a use case. Right now, you would hook that up to the video timeline and not a raised event from the text track. 
>Also, because synthesized speech is successive anyway, you would likely just hand the full text from the cue to the 
>ARIA live region in one go and leave it to the natural passing of time to render the text successively.

Well how would you know what times to use to hook up to the video timeline without searching the getCueAsHtml() fragment for the time stamps? How would the speech engine know to put a pause in before mary's response in the above example?

>Don't get me wrong: we'd want to introduce an event if there is a large developer need. But I haven't come across one yet. 
>Also, it can still be done in the future.

Well yes anything can potentially be done in the future provided its not violating the laws of physics, the point is I am trying to map the models now. So I'm asking the question does this translation cause a need or not?

>> From my perspective though, it means that if I am translating the 
>> above example, I cannot use tools exposed by a browser, I'm going to 
>> have to grub around in the internal structure of the displayed HTML 
>> and try and find the timestamps and set up my own handler on the video 
>> timeline events and then reverse out the CSS on the HTML fragments. 
>> Makes life just a bit harder is all.

>Yes it does. Question is really: is that any different for TTML and how do you manage that?

Precisely. Currently we get an event for every visual change in TTML, because every visual change goes in a new cue. 
If we are translating to VTT, should we attempt to use the wavefront model when its possible, or perhaps even generalise that model so that cues can express more complex inner timings. If the events arent useful or used, then attempting to coalesce cues makes some sense, if authors would use and expect them, then how do we recreate them in the VTT.

I don't have a strong opinion either way, but I would like to have at least thought through the options and take a decision deliberately.

Cheers,
Sean.
Received on Monday, 17 June 2013 06:47:10 UTC