Re: A new proposal for how to deal with text track cues from Philip Jägenstedt on 2013-06-18 (public-texttracks@w3.org from June 2013)

From: Philip Jägenstedt <philipj@opera.com>
Date: Tue, 18 Jun 2013 13:33:10 +0200
To: "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
Cc: public-texttracks@w3.org
Message-ID: <op.wyvg1kyisr6mfa@kirk>
On Fri, 14 Jun 2013 10:57:17 +0200, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> On Fri, Jun 14, 2013 at 6:20 PM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>>
>>>> Making the parser depend on attributes on the track element is
>>>> unnecessary
>>>> coupling, and requiring later re-parsing means that the WebVTT file  
>>>> must
>>>> be
>>>> pinned in cache, even if the HTTP cache headers don't approve. Note  
>>>> also
>>>> that re-parsing would throw away all existing cues together with any
>>>> modifications made by scripts.
>>>
>>>
>>> I think those are all positive consequences: changing the @kind on a
>>> <track> should not become something that programmers frequently use -
>>> I don't see a common use case for it. If it requires re-fetching the
>>> WebVTT file, then so be it. And re-parsing makes sense, because you
>>> may have made changes because you thought the cues were of a
>>> particular type, but they are not, so it's better to reset that.
>>
>>
>> The way I see it, re-parsing serves no purpose, because the WebVTT file  
>> is
>> still the same and will be parsed into the same result, it's just the
>> interpretation of the resulting cues that is different between kinds.  
>> This
>> looks like clean layering to me, is it unsightly from some other
>> perspective?
>
> Re-parsing the cues will have to happen anyway, because the parsing
> and the rendering algorithm both depend on what the cues are being
> interpreted as. For example, a kind=descriptions cue that has SSML
> markup, in contrast to a kind=captions cue that has WebVTT caption cue
> markup. When rendering the first one, a SSML parser will be activated
> and then a SSML descriptions renderer. When rendering the second one,
> the WebVTT caption parser will be activated and then the WebVTT
> caption renderer.
>
> The difference is that right now we shove all this into a single
> object and attach all the different parsing and rendering algorithms
> that are possible with the same object. This is bound to eventually
> end up in a complicated mess with statements such as "these attributes
> and these parsing and rendering algorithms are to be used when the cue
> is interpreted as a caption cue, these other ones for interpretation
> as descriptions, etc etc". Doesn't look like clean layering to me.

I don't understand the above example, are we expecting SSML markup in  
WebVTT? Even if that's the case, the WebVTT file doesn't need to be  
reparsed, it's just the text content of each cue that needs to be  
interpreted differently. Currently, the kind is not an input to the WebVTT  
parsing algorithm, and as long as that remains true there can't be any  
reason to reparse.

>>> It's easier to simply turn off all other tracks when debugging a
>>> specific track than having to edit each cue of a WebVTT file just to
>>> debug its content.
>>
>>
>> True. Still, the settings can still be there, will be parsed, so it's  
>> just a
>> matter of hiding them in the interface.
>
> How do you hide them in the interface?

By using an interface which doesn't expose the attributes, an interface  
which is effectively a subset of WebVTTCue. Is that not what you're  
proposing?

>>> Note also that we're about to write a rendering algorithm for
>>> chapters, so there's no need to turn them into captions/subtitles just
>>> to make them visible.
>>
>> Can you tell me more about this? Aren't chapters used only in the UI?
>
> We have to write a rendering algorithm for chapters at
> http://localhost/~silvia/html5/text-tracks/webvtt/webvtt.html#cues-in-isolation
>  so we get interoperable display of chapters.
>
> I'm going to propose to add them as a list into a menu on the video
> controls. But it is possible to introduce other displays like the
> chapter markers in the examples here:
> http://wiki.whatwg.org/wiki/Use_cases_for_API-level_access_to_timed_tracks#Chapter_Markers
> . We should discuss this separately.

Chapters shouldn't be rendered on top of the video at all except for  
debugging, should they? I still don't understand what this rendering  
algorithm is, or why it would require a special interface for the cues.

>>> You're confusing me - are you supporting the introduction of other
>>> interfaces for other cue formats?
>>
>>
>> I think that for each sufficiently different serialization format for  
>> which
>> there is implementor interest, a cue interface able to well represent  
>> the
>> underlying format should be added.
>>
>>
>>> Don't get me wrong, though: I still believe that TTMLCaptionCue will
>>> get created and it will get created, because it follows a different
>>> caption model than VTTCaptionCue. However, VTTChapterCue and
>>> TTMLChapterCue should not be different and should instead just result
>>> in a ChapterCue object, because we want chapters represented the same
>>> way independent of what serialisation introduced them into the
>>> browser.
>>
>>
>> OK, so I guess this is the crux of the matter: unifying the  
>> representation
>> of chapter cues. What formats other than WebVTT are able to represent
>> chapters?
>
> Plenty others, including DVD chapters, chapters in QuickTime files, in
> MP2, or in MP4 files. But they all parse down to a start time (an
> optional end time) and a plain text string.

Taking MPEG-4 as the example, how are normal cues represented and what  
kind of cue interface would they use? If that interface includes a text  
property, then surely it can also be used for chapters in MPEG-4?

>> I can't find anything in the TTML spec.
>
> I was told that TTML indeed supports chapters, though I haven't seen
> any TTML files in use for that purpose. They would also just be timed
> cues with plain text, I was told.

If that's so, then why not use TTMLCue for them?

>> If TTML chapters look like
>> normal TTML cues, I think it would make more sense to just use a common
>> TTMLCue interface for all TTML cues, like for WebVTT. Unifying the
>> processing of chapters can be layered on top of that, simply by letting  
>> each
>> cue format define how to extract a chapter name and whatever other
>> information is needed. Would that not be simpler?
>
> I don't think so. I think we should distinguish between Cue formats
> based on semantics and not based on the name of the serialisation file
> format that provides it, because there are many file formats that will
> provide the same information to the browser.
>
> Captions are indeed a bit more complicated than all the other timed
> cue formats, which is why I think there will be a TTMLCaptionCue
> object that will be substantially different from a WebVTTCaptionCue.
> It would, though, be nice if we can were able to define a CaptionCue
> object that can be filled either from a TTML or a WebVTT or form a
> CEA708 file or other caption format (unfortunately, WebVTTCue isn't it
> - it has too much WebVTT specifics in it).

I don't see the merit in distinguishing based on semantics, especially if  
the main motivation is chapters and if for each format, the chapter cues  
and normal cues have the same internal representation. Unless there's an  
actual format with actual implementor interest which requires splitting of  
interfaces along the lines you suggest, I think it's just complicating  
things.

-- 
Philip Jägenstedt
Opera Software
Received on Tuesday, 18 June 2013 11:33:44 UTC