Re: Resolving TextTrackCue issues from Glenn Adams on 2013-09-05 (public-html@w3.org from September 2013)

From: Glenn Adams <glenn@skynav.com>
Date: Thu, 5 Sep 2013 10:43:24 -0600
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: Cyril Concolato <cyril.concolato@telecom-paristech.fr>, public-html <public-html@w3.org>
Message-ID: <CACQ=j+ddUXe0Y_suRtfUzFKti1bJpBZPS6KEz0v9tZ9_EmigkA@mail.gmail.com>
On Thu, Sep 5, 2013 at 8:19 AM, Silvia Pfeiffer
<silviapfeiffer1@gmail.com>wrote:

> On Thu, Sep 5, 2013 at 1:03 AM, Cyril Concolato
> <cyril.concolato@telecom-paristech.fr> wrote:
> > Hi Silvia,
> >
> > It is a bit hard to follow this long discussion spread on this list, the
> > blink-dev list, the bug tracker, ... I'll give my understanding in the
> hope
> > that it helps and that it won't add more confusion.
>
> Thanks. It's nice to see the requirements summarised by somebody else, too.
>
>
> > My understanding is that we should distinguish the process which
> generates
> > cues from the process that consumes the cues and draft the interface(s)
> with
> > both processes in mind.
> >
> > There are 2 ways to generate cue objects:
> >
> > A. created by some JS code
> > The content of the cue may be generated client-side or received from XHR.
> > The format of the cue content may be anything: plain text, xml, binary
> data,
> > base64 encoded or not. The data has at least a start time (possibly an
> end
> > time) and should have an associated MIME type. Then you have 2 sub-cases:
> >
> >   A.1 The browser is capable of creating specific objects from the cue
> > content following the MIME type (e.g. WebVTT Node objects, TTML objects,
> > ...). In that case, there should be a way (for instance a dedicated
> > interface) for a JS app to have the cue content parsed and have the
> objects
> > created by the browser: i.e. if the content type of the cue I want to
> > generate is text/CueFormatX, I will check if the browser supports the
> > parsing of the CueFormatX, and call the parsing (via a constructor or
> > another method) to get a specialized object and then access
> > CueFormatX.propertyY if needed.
>
> VTTCue satisfies this.
>
> >   A.2 The browser is not capable of creating specific objects from the
> cue
> > content (e.g. proprietary binary data) or the MIME type is unknown, the
> JS
> > can use a generic constructor or method to store the timed cue content
> for
> > later use.
>
> VTTCue with @kind=metadata would satisfy this, but also the new
> GenericCue interface for any @kind
>
>
> > B. created by the browser
> >   The content of the cues is generated and received, outside of a JS
> > processing, from resources in a format that is understood by the browser
> > (e.g. plain WebVTT files, TTML files, MP4 files, MPEG-2 TS, WebM, ...).
> Same
> > as above, the browser will generate cue objects, ideally as much
> specialized
> > as possible: i.e. if the resources is of type text/vtt, it should create
> > VTTCue; or similar for text/CueFormatX.
> >
> > Then, there are 2 ways to consume the cue objects:
>
> Recent discussion has exposed a third way to consume the cue objects:
>
> E. The browser is able to convert the cue content to a format for
> which it is able to produce a renderable representation. It basically
> pretends to the JS developer that the parsed data is a WebVTT cue.
>
>
> > C. The browser is capable of producing a renderable representation of the
> > cue content (e.g. ideally there is a method (or equivalent)
> > isRenderableTextTrack(mime) which returns true), then:
> >   C.1 If the rendering is left to the browser natively, the track kind is
> > set to subtitles or captions.
>
> VTTCue provides for this. No other rendering algorithm for TextTrack
> cues has been specified.
>

FYI, TTML2 will specify a TTMLCue object and a rendering algorithm. It is
not expected to make use of VTTCue.


>
> >   C.2 If the rendering needs to be altered by the JS, the track kind is
> set
> > to metadata, the JS code calls getCueAsHTML when needed, the result is
> > modified and displayed.
>
> JS is able to get a HTML representation of VTTCue text content, but
> why would there need to be a change of @kind ?
>
>
> > D. The browser is not capable of producing a renderable representation of
> > the cue content
> >    The JS code should handle the rendering of the cue content from the
> given
> > cue objects (specialized or not)
>
> It's this use case D which is at the core of our discussion (assuming
> you include parsing as part of rendering). The W3C spec proposal for
> the GenericCue interface provides for cue content to be exposed by the
> browser and rendered by JS, satisfying your use case D. However, there
> is a position that if browsers are not capable of parsing and
> rendering cue content, they should not expose it to JS at all - in
> particular for captions and subtitles.


I think this position is better described as "parsing and rendering
renderable cue content", namely cues associated with UA renderable track
@kind, i.e., specifically, @kind != "metadata". I do not get the sense
there is opposition to exposing @kind="metadata" cue content to script.


> If they won't, then we can
> simply pretend everything is a WebVTT cue and when not rendered, it's
> of @kind=metadata (even if it's actually caption content).
>

We should treat this as an implementation strategy (on part of particular
UAs), and not something we codify in the spec, though it wouldn't hurt to
mention it as a possible implementation strategy along with text that warns
that this may result in dropping semantics of the source format.


We definitely should not presume this is a strategy that will be
universally followed.


>
>
> > Of course, you could mix how the cues are received with how they are
> > rendered and have:
> > - B+C (e.g. the browser supports parsing of WebVTT into cue nodes and the
> > rendering)
> > - or B+D (receiving an unknown track from an MP4 file (e.g. 3GPP Timed
> Text)
> > and have JS conversion to WebVTT cues),
> > - or A.1+C
> > - or A.1+D
> > - or A.2+D
> > I don't see use cases for A.2+C: if a browser is not capable of creating
> > specialized objects for a format it is probably not capable of rendering
> the
> > cue.
> >
> > I don't have a clear opinion on which design is the best (new cue
> interfaces
> > with/without constructor, methods on the texttrack interface, ...), but I
> > would like to have all use cases possible. Is it the case with the W3C
> > approach?
>
> Yes.
>
> > with the WhatWG approach?
>
> Case D is not supported in the WHATWG approach.
>
> > Could we compared example codes?
>
> I can give you an example: if you have TTML in-band in MP4, it's
> caption content, a browser has no parser and renderer for it, but can
> in theory extract the cues from the MP4 encapsulation -
>
> - the WHATWG spec would either not expose them to JS at all, or expect
> them to be exposed as VTTCue objects with @kind=metadata
>

This would not work, since VTTCue interprets cues of kind metadata as *WebVTT
metadata text* [1], which is most definitely incompatible with TTML that
has been serialized into intermediate synchronic document instances, each
of which is effectively an XML document.

[1]
https://dvcs.w3.org/hg/ttml/raw-file/tip/ttml2/spec/ttml2.html#extension-designations


>
> - the W3C spec as proposed on this thread would expose them to JS as
> GenericCue objects with @kind=captions
>
>
> HTH,
> Silvia.
>
>
> > HTH,
> > Cyril
> >
> >
> > Le 31/08/2013 09:26, Silvia Pfeiffer a écrit :
> >
> >> Hi all,
> >>
> >> Recent changes to the TextTrackCue interface had led to a fork with
> >> the WHATWG spec [1] when resolving bug 21851 [2].
> >>
> >> This caused extensive discussion on blink-dev [3] when an intent to
> >> implement was proposed.
> >>
> >> In the W3C WG we recognize the need for a generic cue interface type
> >> with a constructor and a text attribute. It allows browsers to expose
> >> cues in text tracks of video or audio files for which browsers don't
> >> intend to implement parsers. It also allows JavaScript developers to
> >> create time-synchronized data for media elements in any format they
> >> require.
> >>
> >> The discussion on blink-dev exposed that the currently specified
> >> solution of bug 21851 [2] in the HTML5 spec is flawed in several ways:
> >>
> >> (1) TextTrackCue objects that are not fully abstract create hard to
> >> debug issues of backwards compatibility due to existing code that
> >> assumes "new TextTrackCue()" constructs a cue with VTT semantics;
> >> (2) in order to transition old TextTrackCue interface usage to "new
> >> VTTCue()", it is better to remove the existing TextTrackCue
> >> constructor causing hard failure (easily recognizable) instead of soft
> >> failure (more difficult to recognize);
> >> (3) the abstract TextTrackCue interface of the WHATWG is desirable for
> >> extensibility to non-text-based cue interfaces of the future;
> >> (4) the interface fork between the WHATWG and W3C spec should be
> removed.
> >>
> >> An alternative resolution to bug 21851 [2] has previously been
> >> proposed and discussed: create a new interface that has the text
> >> attribute and the constructor and inherits from the abstract
> >> interface.
> >>
> >> This will result in the following interfaces:
> >>
> >> interface TextTrackCue : EventTarget {
> >>    readonly attribute TextTrack? track;
> >>
> >>             attribute DOMString id;
> >>             attribute double startTime;
> >>             attribute double endTime;
> >>             attribute boolean pauseOnExit;
> >>
> >>             attribute EventHandler onenter;
> >>             attribute EventHandler onexit;
> >> };
> >>
> >> [Constructor(double startTime, double endTime, DOMString text)]
> >> interface GenericCue : TextTrackCue {
> >>             attribute DOMString text;
> >> };
> >>
> >> Whether VTTCue will inherit from GenericCue or from TextTrackCue will
> >> be resolved in the TextTrack CG once this change has been applied to
> >> the HTML5 spec.
> >>
> >> It is my understanding that this proposed change resolves all the
> >> above listed issues. I will therefore apply these changes next week
> >> unless there are any further concerns.
> >>
> >> Regards,
> >> Silvia (as HTML spec editor).
> >>
> >> [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=22903
> >> [2] https://www.w3.org/Bugs/Public/show_bug.cgi?id=21851
> >> [3]
> >>
> https://groups.google.com/a/chromium.org/d/msg/blink-dev/-VHGnuNNUxM/Yibbv2TgDoYJ
> >>
> >
> >
> > --
> > Cyril Concolato
> > Maître de Conférences/Associate Professor
> > Groupe Multimedia/Multimedia Group
> > Telecom ParisTech
> > 46 rue Barrault
> > 75 013 Paris, France
> > http://concolato.wp.mines-telecom.fr/
> >
> >
>
>
Received on Thursday, 5 September 2013 16:44:13 UTC