Re: Resolving TextTrackCue issues

Hi Silvia,

It is a bit hard to follow this long discussion spread on this list, the 
blink-dev list, the bug tracker, ... I'll give my understanding in the 
hope that it helps and that it won't add more confusion.

My understanding is that we should distinguish the process which 
generates cues from the process that consumes the cues and draft the 
interface(s) with both processes in mind.

There are 2 ways to generate cue objects:

A. created by some JS code
The content of the cue may be generated client-side or received from 
XHR. The format of the cue content may be anything: plain text, xml, 
binary data, base64 encoded or not. The data has at least a start time 
(possibly an end time) and should have an associated MIME type. Then you 
have 2 sub-cases:

   A.1 The browser is capable of creating specific objects from the cue 
content following the MIME type (e.g. WebVTT Node objects, TTML objects, 
...). In that case, there should be a way (for instance a dedicated 
interface) for a JS app to have the cue content parsed and have the 
objects created by the browser: i.e. if the content type of the cue I 
want to generate is text/CueFormatX, I will check if the browser 
supports the parsing of the CueFormatX, and call the parsing (via a 
constructor or another method) to get a specialized object and then 
access CueFormatX.propertyY if needed.

   A.2 The browser is not capable of creating specific objects from the 
cue content (e.g. proprietary binary data) or the MIME type is unknown, 
the JS can use a generic constructor or method to store the timed cue 
content for later use.

B. created by the browser
   The content of the cues is generated and received, outside of a JS 
processing, from resources in a format that is understood by the browser 
(e.g. plain WebVTT files, TTML files, MP4 files, MPEG-2 TS, WebM, ...). 
Same as above, the browser will generate cue objects, ideally as much 
specialized as possible: i.e. if the resources is of type text/vtt, it 
should create VTTCue; or similar for text/CueFormatX.

Then, there are 2 ways to consume the cue objects:

C. The browser is capable of producing a renderable representation of 
the cue content (e.g. ideally there is a method (or equivalent) 
isRenderableTextTrack(mime) which returns true), then:
   C.1 If the rendering is left to the browser natively, the track kind 
is set to subtitles or captions.
   C.2 If the rendering needs to be altered by the JS, the track kind is 
set to metadata, the JS code calls getCueAsHTML when needed, the result 
is modified and displayed.

D. The browser is not capable of producing a renderable representation 
of the cue content
    The JS code should handle the rendering of the cue content from the 
given cue objects (specialized or not)

Of course, you could mix how the cues are received with how they are 
rendered and have:
- B+C (e.g. the browser supports parsing of WebVTT into cue nodes and 
the rendering)
- or B+D (receiving an unknown track from an MP4 file (e.g. 3GPP Timed 
Text) and have JS conversion to WebVTT cues),
- or A.1+C
- or A.1+D
- or A.2+D
I don't see use cases for A.2+C: if a browser is not capable of creating 
specialized objects for a format it is probably not capable of rendering 
the cue.

I don't have a clear opinion on which design is the best (new cue 
interfaces with/without constructor, methods on the texttrack interface, 
...), but I would like to have all use cases possible. Is it the case 
with the W3C approach? with the WhatWG approach? Could we compared 
example codes?


Le 31/08/2013 09:26, Silvia Pfeiffer a écrit :
> Hi all,
> Recent changes to the TextTrackCue interface had led to a fork with
> the WHATWG spec [1] when resolving bug 21851 [2].
> This caused extensive discussion on blink-dev [3] when an intent to
> implement was proposed.
> In the W3C WG we recognize the need for a generic cue interface type
> with a constructor and a text attribute. It allows browsers to expose
> cues in text tracks of video or audio files for which browsers don't
> intend to implement parsers. It also allows JavaScript developers to
> create time-synchronized data for media elements in any format they
> require.
> The discussion on blink-dev exposed that the currently specified
> solution of bug 21851 [2] in the HTML5 spec is flawed in several ways:
> (1) TextTrackCue objects that are not fully abstract create hard to
> debug issues of backwards compatibility due to existing code that
> assumes "new TextTrackCue()" constructs a cue with VTT semantics;
> (2) in order to transition old TextTrackCue interface usage to "new
> VTTCue()", it is better to remove the existing TextTrackCue
> constructor causing hard failure (easily recognizable) instead of soft
> failure (more difficult to recognize);
> (3) the abstract TextTrackCue interface of the WHATWG is desirable for
> extensibility to non-text-based cue interfaces of the future;
> (4) the interface fork between the WHATWG and W3C spec should be removed.
> An alternative resolution to bug 21851 [2] has previously been
> proposed and discussed: create a new interface that has the text
> attribute and the constructor and inherits from the abstract
> interface.
> This will result in the following interfaces:
> interface TextTrackCue : EventTarget {
>    readonly attribute TextTrack? track;
>             attribute DOMString id;
>             attribute double startTime;
>             attribute double endTime;
>             attribute boolean pauseOnExit;
>             attribute EventHandler onenter;
>             attribute EventHandler onexit;
> };
> [Constructor(double startTime, double endTime, DOMString text)]
> interface GenericCue : TextTrackCue {
>             attribute DOMString text;
> };
> Whether VTTCue will inherit from GenericCue or from TextTrackCue will
> be resolved in the TextTrack CG once this change has been applied to
> the HTML5 spec.
> It is my understanding that this proposed change resolves all the
> above listed issues. I will therefore apply these changes next week
> unless there are any further concerns.
> Regards,
> Silvia (as HTML spec editor).
> [1]
> [2]
> [3]

Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Multimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France

Received on Wednesday, 4 September 2013 15:03:28 UTC