Re: A new proposal for how to deal with text track cues from Silvia Pfeiffer on 2013-06-14 (public-html@w3.org from June 2013)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Fri, 14 Jun 2013 17:19:05 +1000
To: Pierre-Anthony Lemieux <pal@sandflow.com>
Cc: public-html <public-html@w3.org>
Message-ID: <CAHp8n2=3J_=mKpTxUmL-X_=sY=SnLh48QWtj1xZGHUma4Ekeaw@mail.gmail.com>
Ah yes, you are correct. And indeed, these two sections need adjusting.
Can you register a bug so we don't forget?

Thanks,
Silvia.

On Fri, Jun 14, 2013 at 2:56 PM, Pierre-Anthony Lemieux
<pal@sandflow.com> wrote:
>> You may be looking at HTML5.0. HTML5.1 doesn't contain these any more.
>
> I pulled these two paragraphs from [1], which is HTML 5.1 nightly, right?
>
> [1] http://www.w3.org/html/wg/drafts/html/master/single-page.html
>
> -- Pierre
>
> On Thu, Jun 13, 2013 at 9:54 PM, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com> wrote:
>> On Fri, Jun 14, 2013 at 1:50 AM, Pierre-Anthony Lemieux
>> <pal@sandflow.com> wrote:
>>> Hi Silvia,
>>>
>>> I like the idea of making the HTML cue interface independent from the
>>> underlying serialization format, and move discussions on the latter to
>>> the TTWG, as suggested by others.
>>
>> So you agree that this group should rename TextTrackCue to AbstractCue
>> (or just Cue) and TextTrackCueList to CueList?
>>
>>
>>> In fact, along the same lines, I would move paragraphs [a] and [b]
>>> (see below) of Section 4.8.9 to the WebVTT specification. I think this
>>> would remove the last normative provisions tied to a specific
>>> serialization format.
>>
>> You may be looking at HTML5.0. HTML5.1 doesn't contain these any more.
>>
>> I would indeed suggest that we adjust HTML5.0 to contain the same text
>> as HTML5.1 for tracks.
>>
>>> Hope it makes sense.
>>
>> Indeed.
>> Thanks,
>> Silvia.
>>
>>
>>> Best,
>>>
>>> -- Pierre
>>>
>>> [a] If the element's track URL identifies a WebVTT resource, and the
>>> element's kind attribute is not in the metadata state, then the WebVTT
>>> file must be a WebVTT file using cue text. [WEBVTT]
>>>
>>> [b] Furthermore, if the element's track URL identifies a WebVTT
>>> resource, and the element's kind attribute is in the chapters state,
>>> then the WebVTT file must be both a WebVTT file using chapter title
>>> text and a WebVTT file using only nested cues. [WEBVTT]
>>>
>>> On Tue, Jun 11, 2013 at 10:11 PM, Silvia Pfeiffer
>>> <silviapfeiffer1@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> The model in which we have looked at text tracks (<track> element of
>>>> media elements) thus far has some issues that I would like to point
>>>> out in this email and I would like to suggest a new way to look at
>>>> tracks. This will result in changes to the HTML and WebVTT specs and
>>>> has an influence on others specifying text track cue formats, so I am
>>>> sharing this information widely.
>>>>
>>>> Current situation
>>>> =============
>>>> Text tracks provide lists of timed cues for media elements, i.e. they
>>>> have a start time, an end time, and some content that is to be
>>>> interpreted in sync with the media element's timeline.
>>>>
>>>> WebVTT is the file format that we chose to define as a serialisation
>>>> for the cues (just like audio files serialize audio samples/frames and
>>>> video files serialize video frames).
>>>>
>>>> The means in which we currently parse WebVTT files into JS objects has
>>>> us create objects of type WebVTTCue. These objects contain information
>>>> about any kind of cue that could be included in a WebVTT file -
>>>> captions, subtitles, descriptions, chapters, metadata and whatnot.
>>>>
>>>> The WebVTTCue object looks like this:
>>>>
>>>> enum AutoKeyword { "auto" };
>>>> [Constructor(double startTime, double endTime, DOMString text)]
>>>> interface WebVTTCue : TextTrackCue {
>>>>            attribute DOMString vertical;
>>>>            attribute boolean snapToLines;
>>>>            attribute (long or AutoKeyword) line;
>>>>            attribute long position;
>>>>            attribute long size;
>>>>            attribute DOMString align;
>>>>            attribute DOMString text;
>>>>   DocumentFragment getCueAsHTML();
>>>> };
>>>>
>>>> There are attributes in the WebVTTCue object that relate only to cues
>>>> of kind captions and subtitles (vertical, snapToLines etc). For cues
>>>> of other kinds, the only relevant attribute right now is the text
>>>> attribute.
>>>>
>>>> This works for now, because cues of kind descriptions and chapters are
>>>> only regarded as plain text, and the structure of the content of cues
>>>> of kind metadata is not parsed by the browser. So, for cues of kind
>>>> descriptions, chapters and metadata, that .text attribute is
>>>> sufficient.
>>>>
>>>>
>>>> The consequence
>>>> ===============
>>>> As we continue to evolve the functionality of text tracks, we will
>>>> introduce more complex other structured content into cues and we will
>>>> want browsers to parse and interpret them.
>>>>
>>>> For example, I expect that once we have support for speech synthesis
>>>> in browsers [1], cues of kind descriptions will be voiced by speech
>>>> synthesis, and eventually we want to influence that speech synthesis
>>>> with markup (possibly a subpart of SSML [2] or some other simpler
>>>> markup that influences prosody).
>>>>
>>>> Since we have set ourselves up for parsing all cue content that comes
>>>> out of WebVTT files into WebVTTCue objects, we now have to expand the
>>>> WebVTTCue object with attributes for speech synthesis, e.g. I can
>>>> imagine cue settings for descriptions to contain a field called
>>>> "channelMask" to contain which audio channels a particular cue should
>>>> be rendered into with values being center, left, right.
>>>>
>>>> Another example is that eventually somebody may want to introduce
>>>> ThumbnailCues that contain data URLs for images and may have a
>>>> "transparency" cue setting. Or somebody wants to formalize
>>>> MidrollAdCues that contain data URLs for short video ads and may have
>>>> a "skippableAfterSecs" cue setting.
>>>>
>>>> All of these new cue settings would end up as new attributes on the
>>>> WebVTTCue object. This is a dangerous design path that we have taken.
>>>>
>>>> [1] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
>>>> [2] http://www.w3.org/TR/speech-synthesis/#S3.2
>>>>
>>>>
>>>> Problem analysis
>>>> ================
>>>> What we have done by restricting ourselves to a single WebVTTCue
>>>> object to represent all types of cues that come from a WebVTT file is
>>>> to ignore that WebVTT is just a serialisation format for cues, but
>>>> that cues are the ones that provide the different types of timed
>>>> content to the browser. The browser should not have to care about the
>>>> serialisation format. But it should care about the different types of
>>>> content that a track cue could contain.
>>>>
>>>> For example, it is possible that a WebVTT caption cue (one with all
>>>> the markup and cue settings) can be provided to the browser through a
>>>> WebM file or through a MPEG file or in fact (gasp!) through a TTML
>>>> file. Such a cue should always end up in a WebVTTCue object (will need
>>>> a better name) and not in an object that is specific to the
>>>> serialisation format.
>>>>
>>>> What we have done with WebVTT is actually two-fold:
>>>> 1. we have created a file format that serializes arbitrary content
>>>> that is time-synchronized with a media element.
>>>> 2. and we have created a simple caption/subtitle cue format.
>>>>
>>>> That both are called "WebVTT" is the cause of a lot of confusion and
>>>> not a good design approach.
>>>>
>>>>
>>>> The solution
>>>> ===========
>>>> We thus need to distinguish between cue formats in the browser and not
>>>> between serialisation formats (we don't distinguish between different
>>>> image formats or audio formats in the browser either - we just handle
>>>> audio samples or image pixels).
>>>>
>>>> Once a WebVTT file is parsed into a list of cues, the browser should
>>>> not have to care any more that the list of cues came from a WebVTT
>>>> file or anywhere else. It's a list of cues with a certain type of
>>>> content that has a parsing and a rendering algorithm attached.
>>>>
>>>>
>>>> Spec consequences
>>>> ==================
>>>> What needs to change in the specs to deal with this different approach
>>>> to text tracks is not hard to deduct.
>>>>
>>>>
>>>> Firstly, there are consequences on the WebVTT spec.
>>>>
>>>> I suggest we rename WebVTTCue [1] to VTTCaptionCue and allow such cues
>>>> only on tracks of kind={caption, subtitle}.
>>>> Also, we separate out the WebVTT serialisation format syntax
>>>> specification from the cue syntax specification [2] and introduce
>>>> separate parsers [3] for the different cue syntax formats.
>>>> The rendering section [4] has already started distinguishing between
>>>> cue rendering for chapters and for captions/subtitles. This will
>>>> easily fit with the now separated cue syntax formats.
>>>>
>>>> We will then introduce a ChapterCue which adds a .text attribute and a
>>>> constructor onto AbstractCue for cues (in WebVTT or from elsewhere)
>>>> that are interpreted as chapters and have their own rendering
>>>> algorithm.
>>>> Similarly, we introduce a DescriptionCue which adds a .text attribute
>>>> and a constructor onto AbstractCue and we define a rendering algorithm
>>>> that makes use of the new speech synthesis API [5].
>>>> Similarly, we introduce a MetadataCue which adds a .content attribute
>>>> and a constructor onto AbstractCue with no rendering algorithm.
>>>> I think these new cue objects would even make more sense being defined
>>>> in HTML including their rendering algorithms rather than in the WebVTT
>>>> spec, because they are generic and we don't want chapters to be
>>>> rendered differently just because they have originated from a
>>>> different serialisation format.
>>>>
>>>> [1] http://dev.w3.org/html5/webvtt/#webvtt-api
>>>> [2] http://dev.w3.org/html5/webvtt/#syntax
>>>> [3] http://dev.w3.org/html5/webvtt/#parsing
>>>> [4] http://dev.w3.org/html5/webvtt/#rendering
>>>> [5] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section
>>>>
>>>>
>>>>
>>>> Secondly, there are consequences for the TextTrackCue object hierarchy
>>>> in the HTML spec.
>>>>
>>>> I suggest we rename TextTrackCue [6] to AbstractCue (or just Cue). It
>>>> is simply the abstract result of parsing a serialisation of cues (e.g.
>>>> a WebVTT file) into its individual cues.
>>>>
>>>> Similarly TextTrackCueList [7] should be renamed to CueList and should
>>>> be a cue list of only one particular type of cue. Thus, the parsing
>>>> and rendering algorithm in use for all cues in a CueList is fixed.
>>>> Also, a CueList of e.g. ChapterCues should only be allowed to be
>>>> attached to a track of kind=chapters, etc.
>>>>
>>>> [6] http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcue
>>>> [7] http://www.w3.org/html/wg/drafts/html/master/single-page.html#texttrackcuelist
>>>>
>>>> Doing this will make WebVTT and the TextTrack API extensible for new
>>>> cue formats, such as cues in SSML format, or ThumbnailCues, or
>>>> MidrollAdCues or whatnot else we may see necessary in the future.
>>>>
>>>> This may look like a lot of changes, but it's really just some
>>>> renaming and an introduction of a small number of semantically clean
>>>> new objects. I'm happy to prepare the patches for the WebVTT and
>>>> HTML5.1 specs if this is agreeable.
>>>>
>>>> Feedback welcome.
>>>>
>>>> Regards,
>>>> Silvia.
>>>>
Received on Friday, 14 June 2013 07:19:52 UTC