Re: A new proposal for how to deal with text track cues

On Tue, Jun 18, 2013 at 9:33 PM, Philip Jägenstedt <philipj@opera.com> wrote:
> On Fri, 14 Jun 2013 10:57:17 +0200, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com> wrote:
>
>> On Fri, Jun 14, 2013 at 6:20 PM, Philip Jägenstedt <philipj@opera.com>
>> wrote:
>>>
>>>
>>>>> Making the parser depend on attributes on the track element is
>>>>> unnecessary
>>>>> coupling, and requiring later re-parsing means that the WebVTT file
>>>>> must
>>>>> be
>>>>> pinned in cache, even if the HTTP cache headers don't approve. Note
>>>>> also
>>>>> that re-parsing would throw away all existing cues together with any
>>>>> modifications made by scripts.
>>>>
>>>>
>>>>
>>>> I think those are all positive consequences: changing the @kind on a
>>>> <track> should not become something that programmers frequently use -
>>>> I don't see a common use case for it. If it requires re-fetching the
>>>> WebVTT file, then so be it. And re-parsing makes sense, because you
>>>> may have made changes because you thought the cues were of a
>>>> particular type, but they are not, so it's better to reset that.
>>>
>>>
>>>
>>> The way I see it, re-parsing serves no purpose, because the WebVTT file
>>> is
>>> still the same and will be parsed into the same result, it's just the
>>> interpretation of the resulting cues that is different between kinds.
>>> This
>>> looks like clean layering to me, is it unsightly from some other
>>> perspective?
>>
>>
>> Re-parsing the cues will have to happen anyway, because the parsing
>> and the rendering algorithm both depend on what the cues are being
>> interpreted as. For example, a kind=descriptions cue that has SSML
>> markup, in contrast to a kind=captions cue that has WebVTT caption cue
>> markup. When rendering the first one, a SSML parser will be activated
>> and then a SSML descriptions renderer. When rendering the second one,
>> the WebVTT caption parser will be activated and then the WebVTT
>> caption renderer.
>>
>> The difference is that right now we shove all this into a single
>> object and attach all the different parsing and rendering algorithms
>> that are possible with the same object. This is bound to eventually
>> end up in a complicated mess with statements such as "these attributes
>> and these parsing and rendering algorithms are to be used when the cue
>> is interpreted as a caption cue, these other ones for interpretation
>> as descriptions, etc etc". Doesn't look like clean layering to me.
>
>
> I don't understand the above example, are we expecting SSML markup in
> WebVTT?

Yes, potentially. It's just an example, but we will have different
formats in cues.

> Even if that's the case, the WebVTT file doesn't need to be
> reparsed, it's just the text content of each cue that needs to be
> interpreted differently.

And the cue settings potentially, too.

> Currently, the kind is not an input to the WebVTT
> parsing algorithm, and as long as that remains true there can't be any
> reason to reparse.

That's the way the HTML spec is currently written (implicitly), but it
will have to change when the cue text can contain different types of
content.

We're already starting to do this for rendering:

The browser has "rules for updating the text track rendering"
associated with a list of text track cues. These rues, for a WebVTT
file, are currently the "rules for updating the display of WebVTT text
tracks", which currently maps to "Rendering cues with video" in the
WebVTT spec: http://dev.w3.org/html5/webvtt/#cues-with-video. However,
in the WebVTT spec, we have stared a separate section for "Rendering
cues in isolation", which will describe how to render chapter cues:
http://dev.w3.org/html5/webvtt/#cues-in-isolation. So, kind will be an
input into which rendering algorithm to choose.


>>>> It's easier to simply turn off all other tracks when debugging a
>>>> specific track than having to edit each cue of a WebVTT file just to
>>>> debug its content.
>>>
>>>
>>>
>>> True. Still, the settings can still be there, will be parsed, so it's
>>> just a
>>> matter of hiding them in the interface.
>>
>>
>> How do you hide them in the interface?
>
>
> By using an interface which doesn't expose the attributes, an interface
> which is effectively a subset of WebVTTCue. Is that not what you're
> proposing?

Partially right - since the objects may have different attributes, it
may not be a subset of WebVTTCue.


>>>> Note also that we're about to write a rendering algorithm for
>>>> chapters, so there's no need to turn them into captions/subtitles just
>>>> to make them visible.
>>>
>>>
>>> Can you tell me more about this? Aren't chapters used only in the UI?
>>
>>
>> We have to write a rendering algorithm for chapters at
>>
>> http://localhost/~silvia/html5/text-tracks/webvtt/webvtt.html#cues-in-isolation
>>  so we get interoperable display of chapters.
>>
>> I'm going to propose to add them as a list into a menu on the video
>> controls. But it is possible to introduce other displays like the
>> chapter markers in the examples here:
>>
>> http://wiki.whatwg.org/wiki/Use_cases_for_API-level_access_to_timed_tracks#Chapter_Markers
>> . We should discuss this separately.
>
>
> Chapters shouldn't be rendered on top of the video at all except for
> debugging, should they? I still don't understand what this rendering
> algorithm is, or why it would require a special interface for the cues.


It was suggested to add a image thumbnail to a chapter cue, which is
something very common. The chapters could be represented through dots
on the timeline and a mouseover could render the thumbnails. Or the
chapters could be represented in a menu on the controls with the
thumbnail as an icon.
So, a cue that represent a chapter would have:
* no vertical, snapToLines, line, position, size or align attributes
* no <c>, <i>, <b>, <u>, <v>, <00:00:00:00> elements, and potentially
no <ruby>, <rt>, <lang> elements
* have a url encoded image or icon somewhere in the content of the cue
If we are to still represent a chapter in a WebVTTCue object, we'd
have to introduce parsing of a URL encoded image either into the cue
settings or the cue text.
If (as has been argued in the past) captions and subtitles should not
have images in them, we'd have to further disallow the image data to
be interpreted for cues of type captions and subtitles, but allowed
for cues of type chapter.

We end up having to put exceptions into the parser based on the kind
of the track that the cue is a part of.


>>>> You're confusing me - are you supporting the introduction of other
>>>> interfaces for other cue formats?
>>>
>>>
>>>
>>> I think that for each sufficiently different serialization format for
>>> which
>>> there is implementor interest, a cue interface able to well represent the
>>> underlying format should be added.
>>>
>>>
>>>> Don't get me wrong, though: I still believe that TTMLCaptionCue will
>>>> get created and it will get created, because it follows a different
>>>> caption model than VTTCaptionCue. However, VTTChapterCue and
>>>> TTMLChapterCue should not be different and should instead just result
>>>> in a ChapterCue object, because we want chapters represented the same
>>>> way independent of what serialisation introduced them into the
>>>> browser.
>>>
>>>
>>>
>>> OK, so I guess this is the crux of the matter: unifying the
>>> representation
>>> of chapter cues. What formats other than WebVTT are able to represent
>>> chapters?
>>
>>
>> Plenty others, including DVD chapters, chapters in QuickTime files, in
>> MP2, or in MP4 files. But they all parse down to a start time (an
>> optional end time) and a plain text string.
>
>
> Taking MPEG-4 as the example, how are normal cues represented and what kind
> of cue interface would they use? If that interface includes a text property,
> then surely it can also be used for chapters in MPEG-4?

Are you suggesting to use WebVTTCue as the object to represent
chapters from other formats in?
Why then are we calling it "WebVTT"Cue?
It's exactly this kind of name misrepresentation that I am trying to fix.


>>> I can't find anything in the TTML spec.
>>
>>
>> I was told that TTML indeed supports chapters, though I haven't seen
>> any TTML files in use for that purpose. They would also just be timed
>> cues with plain text, I was told.
>
>
> If that's so, then why not use TTMLCue for them?

A JS developer is parsing a TTML and a WebVTT list of cues. They get a
list of chapters from both files that are of identical format. Both
the TTML cues and the WebVTT cues could have come from a MP4 file,
just from different tracks that were encoded from these original file
formats. Since the browser knows that, it returns a list of TTMLCue
chapters  and a list of WebVTTCue chapters to the JS developer, both
of which have different attributes to deal with. Could he simply
concatenate these two lists into one that contains all his chapters?
No, he'd have to invent a new type for it. Is that a logical API for a
JS developer? I wouldn't say so.


>>> If TTML chapters look like
>>> normal TTML cues, I think it would make more sense to just use a common
>>> TTMLCue interface for all TTML cues, like for WebVTT. Unifying the
>>> processing of chapters can be layered on top of that, simply by letting
>>> each
>>> cue format define how to extract a chapter name and whatever other
>>> information is needed. Would that not be simpler?
>>
>>
>> I don't think so. I think we should distinguish between Cue formats
>> based on semantics and not based on the name of the serialisation file
>> format that provides it, because there are many file formats that will
>> provide the same information to the browser.
>>
>> Captions are indeed a bit more complicated than all the other timed
>> cue formats, which is why I think there will be a TTMLCaptionCue
>> object that will be substantially different from a WebVTTCaptionCue.
>> It would, though, be nice if we can were able to define a CaptionCue
>> object that can be filled either from a TTML or a WebVTT or form a
>> CEA708 file or other caption format (unfortunately, WebVTTCue isn't it
>> - it has too much WebVTT specifics in it).
>
>
> I don't see the merit in distinguishing based on semantics, especially if
> the main motivation is chapters and if for each format, the chapter cues and
> normal cues have the same internal representation. Unless there's an actual
> format with actual implementor interest which requires splitting of
> interfaces along the lines you suggest, I think it's just complicating
> things.

See if you still think so with the above example.

Silvia.

Received on Wednesday, 19 June 2013 04:41:54 UTC