Re: WebVTT from Silvia Pfeiffer on 2013-06-17 (public-tt@w3.org from June 2013)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Mon, 17 Jun 2013 12:56:57 +1000
To: Sean Hayes <Sean.Hayes@microsoft.com>
Cc: John Birch <John.Birch@screensystems.tv>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <CAHp8n2najMj=uX94-Ot=cZALoPN_e=-4XQv2jHA2NcrXd1Y6Ag@mail.gmail.com>
On Sun, Jun 16, 2013 at 10:03 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>>Oh right, you misunderstand how :past and :future are applied. They only
>> work if you have provided time stamps in your markup ><00:00:00:00>. And
>> time matches on according to the video, so video.current time provides the
>> JS dev the current time. And the >timeupdate event provides regular hooks
>> into that playback loop.
>
> Actually I think I understand pretty well how these are applied. I think you
> maybe misunderstand what I’m trying to do here.

Sure, that's possible, too.

> I’m not really interested in
> WebVTT as a user might be. My purpose here is to determine exactly what
> WebVTT is capable of, which involves digging into all the corner cases, and
> to what extent a full fidelity translation is possible between the two
> formats, where they share a common model and where they diverge.
>
> This is a point of divergence, TTML cues have no internal time structure,
> everything in one cue happens “in the now”, any observable change in the on
> screen display happens from one cue to the next, and there is an event for
> that.

How do you do karaoke then? Is every new word a new cue?

> In contrast WebVTT has effectively a wavefront that passes through it
> dividing a cue into three regions past, present and future over time,

Only for the cues where text is successively displayed or changed. The
text is already positioned and rendered in this case, so only the
style parameters are updated during this "wavefront pass-through".

> therefore observable changes can happen within a cue, and there are no
> events for those changes.
>
> In order to convert from one to the other we have to ascertain whether this
> is significant, and it may not be. But in my opinion a caption format
> expresses its semantics through visual presentation over time, (well at
> least today it does – we have discussed in the past a higher level semantic
> encoding, but that’s never really got anywhere). Thus time based semantic
> information is being conveyed within WebVTT cues which appears to be hidden;

It's not hidden - for the in-cue time stamps it's just a bit more
difficult. You can us the timeupdate event of the <video> element and
then compare the currentTime with the time stamps in the cue to find
out where the current display position is.

> this could be a problem if I am using AT to follow the captions for example.
> Anyway the question becomes how do we translate this.

OK, if this is the key issue, then we need to understand how this
works in TTML. How do you mark up paint-in captions? And what do you
need the events for?


>>Also, these CSS classes don't necessarily make text invisible - they may
>> just change its color or font weight (think of karaoke).
>
> I can see what they can do, and amongst the things they can do is make text
> visible/invisible. Part of the exercise here is to, as much as possible,
> predict what a user might do and come to rely on. One thing I have learned
> over the years is never to underestimate the creativity of users to
> completely blindside you with what they do with the tools you create.
>
> For example, it would be perfectly legal, given the current text, for a user
> have just a single cue and rely only on CSS to convey a slideshow of images.
>
> 00:00:00.000 --> 00:10:00:00
>
> <c.a>&nbsp;</c><00:00:01.000><c.b>&nbsp;</c><00:00:02.000><c.c>&nbsp;</c><00:00:03.000><c.d>&nbsp;</c>…
>
>
>
> Where
>
>
>
> ::cue(c)::past {
>
>             transform:scale(0,0);
>
> }

This won't do anything - it should be:
::cue(:past) {...}

(yes, I need to add an example: just filed a bug
https://www.w3.org/Bugs/Public/show_bug.cgi?id=22389)


> ::cue(c)::future {
>
>             transform:scale(0,0);
>
> }

Same here: needs to be:
::cue(:future) {...}

> ::cue(c.a) {
>
>             background-size:contain;
>
>             background-image: url(A.jpg);
>
>             transform:scale(60,60);
>
>             transform-origin:0% 0%;
>
>             background-repeat:no-repeat;
>
> }
>
> ::cue(c.b) {
>
>             background-size:contain;
>
>             background-image: url(B.jpg);
>
>             transform:scale(60,60);
>
>             transform-origin:0% 0%;
>
>             background-repeat:no-repeat;
>
> }
>
> ::cue(c.c) {
>
>             background-size:contain;
>
>             background-image: url(C.jpg);
>
>             transform:scale(60,60);
>
>             transform-origin:0% 0%;
>
>             background-repeat:no-repeat;
>
> }

Note that 'transform' is not an allowed CSS property on cues:
http://dev.w3.org/html5/webvtt/#the-cue-pseudo-element

Nevertheless - I don't see a problem with this - it's perfectly legal
markup. If that's how the JS dev wants it, then that's fine.
If they want events raised, they have to put the images into separate cues.

But I still wonder what they would need the event raising for? The
timing of the individual slides has already been set up - there is no
need to interrupt this flow in JS and do something.


>>Why would a JS dev need to be alerted of CSS changes? What's the use case?
>
> Well I’m not sure I’ve ever been sold on the use case for JS having access
> into the captions at all actually, except maybe for metadata tracks, but if
> they do, one would assume you would want to provide full access to the
> semantics of the format; of course you don’t have to, that’s really up to
> you and the community that sees this as a valuable format.  One specific use
> case might be putting the visible text into an ARIA live region at the right
> time.

That's a use case. Right now, you would hook that up to the video
timeline and not a raised event from the text track. Also, because
synthesized speech is successive anyway, you would likely just hand
the full text from the cue to the ARIA live region in one go and leave
it to the natural passing of time to render the text successively.

Don't get me wrong: we'd want to introduce an event if there is a
large developer need. But I haven't come across one yet. Also, it can
still be done in the future.


> From my perspective though, it means that if I am translating the above
> example, I cannot use tools exposed by a browser, I’m going to have to grub
> around in the internal structure of the displayed HTML and try and find the
> timestamps and set up my own handler on the video timeline events and then
> reverse out the CSS on the HTML fragments. Makes life just a bit harder is
> all.

Yes it does. Question is really: is that any different for TTML and
how do you manage that?

Cheers,
Silvia.
Received on Monday, 17 June 2013 02:57:44 UTC