Re: A new proposal for how to deal with text track cues

Please note that this was written also as a consequence to the
concerns raised in
https://www.w3.org/Bugs/Public/show_bug.cgi?id=21851 and as an
extension to the proposals of
https://www.w3.org/Bugs/Public/show_bug.cgi?id=21851#c21 .

Regards,
Silvia.

On Wed, Jun 12, 2013 at 3:11 PM, Silvia Pfeiffer
<silviapfeiffer1@gmail.com> wrote:
> Hi all,
>
> The model in which we have looked at text tracks (<track> element of
> media elements) thus far has some issues that I would like to point
> out in this email and I would like to suggest a new way to look at
> tracks. This will result in changes to the HTML and WebVTT specs and
> has an influence on others specifying text track cue formats, so I am
> sharing this information widely.
>
> Current situation
> =============
> Text tracks provide lists of timed cues for media elements, i.e. they
> have a start time, an end time, and some content that is to be
> interpreted in sync with the media element's timeline.
>
> WebVTT is the file format that we chose to define as a serialisation
> for the cues (just like audio files serialize audio samples/frames and
> video files serialize video frames).
>
> The means in which we currently parse WebVTT files into JS objects has
> us create objects of type WebVTTCue. These objects contain information
> about any kind of cue that could be included in a WebVTT file -
> captions, subtitles, descriptions, chapters, metadata and whatnot.
>
> The WebVTTCue object looks like this:
>
> enum AutoKeyword { "auto" };
> [Constructor(double startTime, double endTime, DOMString text)]
> interface WebVTTCue : TextTrackCue {
>            attribute DOMString vertical;
>            attribute boolean snapToLines;
>            attribute (long or AutoKeyword) line;
>            attribute long position;
>            attribute long size;
>            attribute DOMString align;
>            attribute DOMString text;
>   DocumentFragment getCueAsHTML();
> };
>
> There are attributes in the WebVTTCue object that relate only to cues
> of kind captions and subtitles (vertical, snapToLines etc). For cues
> of other kinds, the only relevant attribute right now is the text
> attribute.
>
> This works for now, because cues of kind descriptions and chapters are
> only regarded as plain text, and the structure of the content of cues
> of kind metadata is not parsed by the browser. So, for cues of kind
> descriptions, chapters and metadata, that .text attribute is
> sufficient.
>
>
> The consequence
> ===============
> As we continue to evolve the functionality of text tracks, we will
> introduce more complex other structured content into cues and we will
> want browsers to parse and interpret them.
>
> For example, I expect that once we have support for speech synthesis
> in browsers [1], cues of kind descriptions will be voiced by speech
> synthesis, and eventually we want to influence that speech synthesis
> with markup (possibly a subpart of SSML [2] or some other simpler
> markup that influences prosody).
>
> Since we have set ourselves up for parsing all cue content that comes
> out of WebVTT files into WebVTTCue objects, we now have to expand the
> WebVTTCue object with attributes for speech synthesis, e.g. I can
> imagine cue settings for descriptions to contain a field called
> "channelMask" to contain which audio channels a particular cue should
> be rendered into with values being center, left, right.
>
> Another example is that eventually somebody may want to introduce
> ThumbnailCues that contain data URLs for images and may have a
> "transparency" cue setting. Or somebody wants to formalize
> MidrollAdCues that contain data URLs for short video ads and may have
> a "skippableAfterSecs" cue setting.
>
> All of these new cue settings would end up as new attributes on the
> WebVTTCue object. This is a dangerous design path that we have taken.
>
> [1] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
> [2] http://www.w3.org/TR/speech-synthesis/#S3.2
>
>
> Problem analysis
> ================
> What we have done by restricting ourselves to a single WebVTTCue
> object to represent all types of cues that come from a WebVTT file is
> to ignore that WebVTT is just a serialisation format for cues, but
> that cues are the ones that provide the different types of timed
> content to the browser. The browser should not have to care about the
> serialisation format. But it should care about the different types of
> content that a track cue could contain.
>
> For example, it is possible that a WebVTT caption cue (one with all
> the markup and cue settings) can be provided to the browser through a
> WebM file or through a MPEG file or in fact (gasp!) through a TTML
> file. Such a cue should always end up in a WebVTTCue object (will need
> a better name) and not in an object that is specific to the
> serialisation format.
>
> What we have done with WebVTT is actually two-fold:
> 1. we have created a file format that serializes arbitrary content
> that is time-synchronized with a media element.
> 2. and we have created a simple caption/subtitle cue format.
>
> That both are called "WebVTT" is the cause of a lot of confusion and
> not a good design approach.
>
>
> The solution
> ===========
> We thus need to distinguish between cue formats in the browser and not
> between serialisation formats (we don't distinguish between different
> image formats or audio formats in the browser either - we just handle
> audio samples or image pixels).
>
> Once a WebVTT file is parsed into a list of cues, the browser should
> not have to care any more that the list of cues came from a WebVTT
> file or anywhere else. It's a list of cues with a certain type of
> content that has a parsing and a rendering algorithm attached.
>
>
> Spec consequences
> ==================
> What needs to change in the specs to deal with this different approach
> to text tracks is not hard to deduct.
>
>
> Firstly, there are consequences on the WebVTT spec.
>
> I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues
> only on tracks of kind={caption, subtitle}.
> Also, we separate out the WebVTT serialisation format syntax
> specification from the cue syntax specification [2] and introduce
> separate parsers [3] for the different cue syntax formats.
> The rendering section [4] has already started distinguishing between
> cue rendering for chapters and for captions/subtitles. This will
> easily fit with the now separated cue syntax formats.
>
> We will then introduce a ChapterCue which adds a .text attribute and a
> constructor onto AbstractCue for cues (in WebVTT or from elsewhere)
> that are interpreted as chapters and have their own rendering
> algorithm.
> Similarly, we introduce a DescriptionCue which adds a .text attribute
> and a constructor onto AbstractCue and we define a rendering algorithm
> that makes use of the new speech synthesis API [5].
> Similarly, we introduce a MetadataCue which adds a .content attribute
> and a constructor onto AbstractCue with no rendering algorithm.
> I think these new cue objects would even make more sense being defined
> in HTML including their rendering algorithms rather than in the WebVTT
> spec, because they are generic and we don't want chapters to be
> rendered differently just because they have originated from a
> different serialisation format.
>
> [1] http://dev.w3.org/html5/webvtt/#webvtt-api
> [2] http://dev.w3.org/html5/webvtt/#syntax
> [3] http://dev.w3.org/html5/webvtt/#parsing
> [4] http://dev.w3.org/html5/webvtt/#rendering
> [5] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
>
>
>
> Secondly, there are consequences for the TextTrackCue object hierarchy
> in the HTML spec.
>
> I suggest we rename TextTrackCue [6] to AbstractCue (or just Cue). It
> is simply the abstract result of parsing a serialisation of cues (e.g.
> a WebVTT file) into its individual cues.
>
> Similarly TextTrackCueList [7] should be renamed to CueList and should
> be a cue list of only one particular type of cue. Thus, the parsing
> and rendering algorithm in use for all cues in a CueList is fixed.
> Also, a CueList of e.g. ChapterCues should only be allowed to be
> attached to a track of kind=chapters, etc.
>
> [6] http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcue
> [7] http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcuelist
>
> Doing this will make WebVTT and the TextTrack API extensible for new
> cue formats, such as cues in SSML format, or ThumbnailCues, or
> MidrollAdCues or whatnot else we may see necessary in the future.
>
> This may look like a lot of changes, but it's really just some
> renaming and an introduction of a small number of semantically clean
> new objects. I'm happy to prepare the patches for the WebVTT and
> HTML5.1 specs if this is agreeable.
>
> Feedback welcome.
>
> Regards,
> Silvia.

Received on Wednesday, 12 June 2013 13:19:52 UTC