Re: A new proposal for how to deal with text track cues from Goldstein, Glenn on 2013-06-12 (public-texttracks@w3.org from June 2013)

From: Goldstein, Glenn <glenn.goldstein@viacom.com>
Date: Wed, 12 Jun 2013 13:11:09 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>, "public-texttracks@w3.org" <public-texttracks@w3.org>
Message-ID: <CDDDEB46.7B02%Glenn.Goldstein@viacom.com>
makes sense silvia.

breaking the tight coupling between TextTrack Cues and the specifics of
WebVTT is the way to go.

thanks,
glenn

glenn goldstein
vice president, media technology strategy
Tel: (212) 846-3210
Email: glenn.goldstein@viacom.com






On 6/12/13 1:12 AM, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com> wrote:

>[Note: this was sent to the HTML WG and WHATWG, too - sorry if you
>receive it multiple times]
>
>Hi all,
>
>The model in which we have looked at text tracks (<track> element of
>media elements) thus far has some issues that I would like to point
>out in this email and I would like to suggest a new way to look at
>tracks. This will result in changes to the HTML and WebVTT specs and
>has an influence on others specifying text track cue formats, so I am
>sharing this information widely.
>
>Current situation
>=============
>Text tracks provide lists of timed cues for media elements, i.e. they
>have a start time, an end time, and some content that is to be
>interpreted in sync with the media element's timeline.
>
>WebVTT is the file format that we chose to define as a serialisation
>for the cues (just like audio files serialize audio samples/frames and
>video files serialize video frames).
>
>The means in which we currently parse WebVTT files into JS objects has
>us create objects of type WebVTTCue. These objects contain information
>about any kind of cue that could be included in a WebVTT file -
>captions, subtitles, descriptions, chapters, metadata and whatnot.
>
>The WebVTTCue object looks like this:
>
>enum AutoKeyword { "auto" };
>[Constructor(double startTime, double endTime, DOMString text)]
>interface WebVTTCue : TextTrackCue {
>           attribute DOMString vertical;
>           attribute boolean snapToLines;
>           attribute (long or AutoKeyword) line;
>           attribute long position;
>           attribute long size;
>           attribute DOMString align;
>           attribute DOMString text;
>  DocumentFragment getCueAsHTML();
>};
>
>There are attributes in the WebVTTCue object that relate only to cues
>of kind captions and subtitles (vertical, snapToLines etc). For cues
>of other kinds, the only relevant attribute right now is the text
>attribute.
>
>This works for now, because cues of kind descriptions and chapters are
>only regarded as plain text, and the structure of the content of cues
>of kind metadata is not parsed by the browser. So, for cues of kind
>descriptions, chapters and metadata, that .text attribute is
>sufficient.
>
>
>The consequence
>===============
>As we continue to evolve the functionality of text tracks, we will
>introduce more complex other structured content into cues and we will
>want browsers to parse and interpret them.
>
>For example, I expect that once we have support for speech synthesis
>in browsers [1], cues of kind descriptions will be voiced by speech
>synthesis, and eventually we want to influence that speech synthesis
>with markup (possibly a subpart of SSML [2] or some other simpler
>markup that influences prosody).
>
>Since we have set ourselves up for parsing all cue content that comes
>out of WebVTT files into WebVTTCue objects, we now have to expand the
>WebVTTCue object with attributes for speech synthesis, e.g. I can
>imagine cue settings for descriptions to contain a field called
>"channelMask" to contain which audio channels a particular cue should
>be rendered into with values being center, left, right.
>
>Another example is that eventually somebody may want to introduce
>ThumbnailCues that contain data URLs for images and may have a
>"transparency" cue setting. Or somebody wants to formalize
>MidrollAdCues that contain data URLs for short video ads and may have
>a "skippableAfterSecs" cue setting.
>
>All of these new cue settings would end up as new attributes on the
>WebVTTCue object. This is a dangerous design path that we have taken.
>
>[1] 
>https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
>[2] http://www.w3.org/TR/speech-synthesis/#S3.2
>
>
>Problem analysis
>================
>What we have done by restricting ourselves to a single WebVTTCue
>object to represent all types of cues that come from a WebVTT file is
>to ignore that WebVTT is just a serialisation format for cues, but
>that cues are the ones that provide the different types of timed
>content to the browser. The browser should not have to care about the
>serialisation format. But it should care about the different types of
>content that a track cue could contain.
>
>For example, it is possible that a WebVTT caption cue (one with all
>the markup and cue settings) can be provided to the browser through a
>WebM file or through a MPEG file or in fact (gasp!) through a TTML
>file. Such a cue should always end up in a WebVTTCue object (will need
>a better name) and not in an object that is specific to the
>serialisation format.
>
>What we have done with WebVTT is actually two-fold:
>1. we have created a file format that serializes arbitrary content
>that is time-synchronized with a media element.
>2. and we have created a simple caption/subtitle cue format.
>
>That both are called "WebVTT" is the cause of a lot of confusion and
>not a good design approach.
>
>
>The solution
>===========
>We thus need to distinguish between cue formats in the browser and not
>between serialisation formats (we don't distinguish between different
>image formats or audio formats in the browser either - we just handle
>audio samples or image pixels).
>
>Once a WebVTT file is parsed into a list of cues, the browser should
>not have to care any more that the list of cues came from a WebVTT
>file or anywhere else. It's a list of cues with a certain type of
>content that has a parsing and a rendering algorithm attached.
>
>
>Spec consequences
>==================
>What needs to change in the specs to deal with this different approach
>to text tracks is not hard to deduct.
>
>
>Firstly, there are consequences on the WebVTT spec.
>
>I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues
>only on tracks of kind={caption, subtitle}.
>Also, we separate out the WebVTT serialisation format syntax
>specification from the cue syntax specification [2] and introduce
>separate parsers [3] for the different cue syntax formats.
>The rendering section [4] has already started distinguishing between
>cue rendering for chapters and for captions/subtitles. This will
>easily fit with the now separated cue syntax formats.
>
>We will then introduce a ChapterCue which adds a .text attribute and a
>constructor onto AbstractCue for cues (in WebVTT or from elsewhere)
>that are interpreted as chapters and have their own rendering
>algorithm.
>Similarly, we introduce a DescriptionCue which adds a .text attribute
>and a constructor onto AbstractCue and we define a rendering algorithm
>that makes use of the new speech synthesis API [5].
>Similarly, we introduce a MetadataCue which adds a .content attribute
>and a constructor onto AbstractCue with no rendering algorithm.
>I think these new cue objects would even make more sense being defined
>in HTML including their rendering algorithms rather than in the WebVTT
>spec, because they are generic and we don't want chapters to be
>rendered differently just because they have originated from a
>different serialisation format.
>
>[1] http://dev.w3.org/html5/webvtt/#webvtt-api
>[2] http://dev.w3.org/html5/webvtt/#syntax
>[3] http://dev.w3.org/html5/webvtt/#parsing
>[4] http://dev.w3.org/html5/webvtt/#rendering
>[5] 
>https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
>
>
>
>Secondly, there are consequences for the TextTrackCue object hierarchy
>in the HTML spec.
>
>I suggest we rename TextTrackCue [6] to AbstractCue (or just Cue). It
>is simply the abstract result of parsing a serialisation of cues (e.g.
>a WebVTT file) into its individual cues.
>
>Similarly TextTrackCueList [7] should be renamed to CueList and should
>be a cue list of only one particular type of cue. Thus, the parsing
>and rendering algorithm in use for all cues in a CueList is fixed.
>Also, a CueList of e.g. ChapterCues should only be allowed to be
>attached to a track of kind=chapters, etc.
>
>[6] 
>http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcue
>[7] 
>http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcue
>list
>
>
>Doing this will make WebVTT and the TextTrack API extensible for new
>cue formats, such as cues in SSML format, or ThumbnailCues, or
>MidrollAdCues or whatnot else we may see necessary in the future.
>
>This may look like a lot of changes, but it's really just some
>renaming and an introduction of a small number of semantically clean
>new objects. I'm happy to prepare the patches for the WebVTT and
>HTML5.1 specs if this is agreeable.
>
>Feedback welcome.
>
>Regards,
>Silvia.
>
Received on Wednesday, 12 June 2013 13:11:47 UTC