- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Wed, 12 Jun 2013 23:19:04 +1000
- To: public-html <public-html@w3.org>
Please note that this was written also as a consequence to the concerns raised in https://www.w3.org/Bugs/Public/show_bug.cgi?id=21851 and as an extension to the proposals of https://www.w3.org/Bugs/Public/show_bug.cgi?id=21851#c21 . Regards, Silvia. On Wed, Jun 12, 2013 at 3:11 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote: > Hi all, > > The model in which we have looked at text tracks (<track> element of > media elements) thus far has some issues that I would like to point > out in this email and I would like to suggest a new way to look at > tracks. This will result in changes to the HTML and WebVTT specs and > has an influence on others specifying text track cue formats, so I am > sharing this information widely. > > Current situation > ============= > Text tracks provide lists of timed cues for media elements, i.e. they > have a start time, an end time, and some content that is to be > interpreted in sync with the media element's timeline. > > WebVTT is the file format that we chose to define as a serialisation > for the cues (just like audio files serialize audio samples/frames and > video files serialize video frames). > > The means in which we currently parse WebVTT files into JS objects has > us create objects of type WebVTTCue. These objects contain information > about any kind of cue that could be included in a WebVTT file - > captions, subtitles, descriptions, chapters, metadata and whatnot. > > The WebVTTCue object looks like this: > > enum AutoKeyword { "auto" }; > [Constructor(double startTime, double endTime, DOMString text)] > interface WebVTTCue : TextTrackCue { > attribute DOMString vertical; > attribute boolean snapToLines; > attribute (long or AutoKeyword) line; > attribute long position; > attribute long size; > attribute DOMString align; > attribute DOMString text; > DocumentFragment getCueAsHTML(); > }; > > There are attributes in the WebVTTCue object that relate only to cues > of kind captions and subtitles (vertical, snapToLines etc). For cues > of other kinds, the only relevant attribute right now is the text > attribute. > > This works for now, because cues of kind descriptions and chapters are > only regarded as plain text, and the structure of the content of cues > of kind metadata is not parsed by the browser. So, for cues of kind > descriptions, chapters and metadata, that .text attribute is > sufficient. > > > The consequence > =============== > As we continue to evolve the functionality of text tracks, we will > introduce more complex other structured content into cues and we will > want browsers to parse and interpret them. > > For example, I expect that once we have support for speech synthesis > in browsers [1], cues of kind descriptions will be voiced by speech > synthesis, and eventually we want to influence that speech synthesis > with markup (possibly a subpart of SSML [2] or some other simpler > markup that influences prosody). > > Since we have set ourselves up for parsing all cue content that comes > out of WebVTT files into WebVTTCue objects, we now have to expand the > WebVTTCue object with attributes for speech synthesis, e.g. I can > imagine cue settings for descriptions to contain a field called > "channelMask" to contain which audio channels a particular cue should > be rendered into with values being center, left, right. > > Another example is that eventually somebody may want to introduce > ThumbnailCues that contain data URLs for images and may have a > "transparency" cue setting. Or somebody wants to formalize > MidrollAdCues that contain data URLs for short video ads and may have > a "skippableAfterSecs" cue setting. > > All of these new cue settings would end up as new attributes on the > WebVTTCue object. This is a dangerous design path that we have taken. > > [1] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section > [2] http://www.w3.org/TR/speech-synthesis/#S3.2 > > > Problem analysis > ================ > What we have done by restricting ourselves to a single WebVTTCue > object to represent all types of cues that come from a WebVTT file is > to ignore that WebVTT is just a serialisation format for cues, but > that cues are the ones that provide the different types of timed > content to the browser. The browser should not have to care about the > serialisation format. But it should care about the different types of > content that a track cue could contain. > > For example, it is possible that a WebVTT caption cue (one with all > the markup and cue settings) can be provided to the browser through a > WebM file or through a MPEG file or in fact (gasp!) through a TTML > file. Such a cue should always end up in a WebVTTCue object (will need > a better name) and not in an object that is specific to the > serialisation format. > > What we have done with WebVTT is actually two-fold: > 1. we have created a file format that serializes arbitrary content > that is time-synchronized with a media element. > 2. and we have created a simple caption/subtitle cue format. > > That both are called "WebVTT" is the cause of a lot of confusion and > not a good design approach. > > > The solution > =========== > We thus need to distinguish between cue formats in the browser and not > between serialisation formats (we don't distinguish between different > image formats or audio formats in the browser either - we just handle > audio samples or image pixels). > > Once a WebVTT file is parsed into a list of cues, the browser should > not have to care any more that the list of cues came from a WebVTT > file or anywhere else. It's a list of cues with a certain type of > content that has a parsing and a rendering algorithm attached. > > > Spec consequences > ================== > What needs to change in the specs to deal with this different approach > to text tracks is not hard to deduct. > > > Firstly, there are consequences on the WebVTT spec. > > I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues > only on tracks of kind={caption, subtitle}. > Also, we separate out the WebVTT serialisation format syntax > specification from the cue syntax specification [2] and introduce > separate parsers [3] for the different cue syntax formats. > The rendering section [4] has already started distinguishing between > cue rendering for chapters and for captions/subtitles. This will > easily fit with the now separated cue syntax formats. > > We will then introduce a ChapterCue which adds a .text attribute and a > constructor onto AbstractCue for cues (in WebVTT or from elsewhere) > that are interpreted as chapters and have their own rendering > algorithm. > Similarly, we introduce a DescriptionCue which adds a .text attribute > and a constructor onto AbstractCue and we define a rendering algorithm > that makes use of the new speech synthesis API [5]. > Similarly, we introduce a MetadataCue which adds a .content attribute > and a constructor onto AbstractCue with no rendering algorithm. > I think these new cue objects would even make more sense being defined > in HTML including their rendering algorithms rather than in the WebVTT > spec, because they are generic and we don't want chapters to be > rendered differently just because they have originated from a > different serialisation format. > > [1] http://dev.w3.org/html5/webvtt/#webvtt-api > [2] http://dev.w3.org/html5/webvtt/#syntax > [3] http://dev.w3.org/html5/webvtt/#parsing > [4] http://dev.w3.org/html5/webvtt/#rendering > [5] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section > > > > Secondly, there are consequences for the TextTrackCue object hierarchy > in the HTML spec. > > I suggest we rename TextTrackCue [6] to AbstractCue (or just Cue). It > is simply the abstract result of parsing a serialisation of cues (e.g. > a WebVTT file) into its individual cues. > > Similarly TextTrackCueList [7] should be renamed to CueList and should > be a cue list of only one particular type of cue. Thus, the parsing > and rendering algorithm in use for all cues in a CueList is fixed. > Also, a CueList of e.g. ChapterCues should only be allowed to be > attached to a track of kind=chapters, etc. > > [6] http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcue > [7] http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcuelist > > Doing this will make WebVTT and the TextTrack API extensible for new > cue formats, such as cues in SSML format, or ThumbnailCues, or > MidrollAdCues or whatnot else we may see necessary in the future. > > This may look like a lot of changes, but it's really just some > renaming and an introduction of a small number of semantically clean > new objects. I'm happy to prepare the patches for the WebVTT and > HTML5.1 specs if this is agreeable. > > Feedback welcome. > > Regards, > Silvia.
Received on Wednesday, 12 June 2013 13:19:52 UTC