RE: Tech Discussions on the Multitrack Media (issue-152)

See inline

From: David Singer [mailto:singer@apple.com]
Sent: Wednesday, February 23, 2011 5:23 PM
To: Bob Lund
Cc: public-html@w3.org
Subject: Re: Tech Discussions on the Multitrack Media (issue-152)


On Feb 17, 2011, at 10:06 , Bob Lund wrote:


I have several comments on the proposal alternatives on the wiki (which have been very informative). As a first poster let me introduce myself - I represent CableLabs, where we've been analyzing commercial video service provider requirements and how HTML5, and timed text tracks and multimedia tracks, can be used to meet those requirements.

Overloading the existing track element representing Timed Text Tracks for media tracks would mix two fundamentally different models. Timed Text Tracks have cues with substantially different semantics than continuous media tracks. Side condition 8 notes this. I think it's a good idea to keep Timed Text Tracks separate from continuous audio and video tracks. This would seem to rule out 1), 2) and 7).

I don't understand our point here.  Timed text, audio, and video, all lay out a timed presentation along a timeline.  What is 'fundamentally different' about text other than the way it is encoded (and its timing is expressed)?  In Quicktime and 3GPP, we have used 'text formatted' tracks for years with some success, so my curiosity is piqued.

[Bob Lund] I agree with your observation that timed text, audio and video all layout a presentation along a timeline. In the context of HTML5, though, "Timed Text Tracks" expose "cues" with a start and end time, and data - either text or metadata. The proposed multi-track media APIs expose the presence of additional tracks along with the ability to denote whether the track is showing.. There is no "cue" and there is no access to the data in the track.


I think it's great to keep separate the concepts, in a timed presentation track:
* the general media type (video, audio, text, metadata, and so on)
* the encoding type (webM video, AAC audio, EBCDIC text, and so on)
* the function or need that the track is meeting (primary audio, captions, sign-language video, flicker-reduced video, and so on)


Comparing captions and sign-language, I note
* they both need a visual display area
* one is typically encoded as text, the other as video (though burned-in captions, or graphic captions in Japanese, are closer to video)
* they are both optional timed sequences of semantic information that is auxiliary to the main program

[Bob Lund] I agree with the conceptual similarities you note but again the difference between "Timed Text Tracks" and the proposed multi-track media APIs seem to introduce differences.

In the case of captions exposed via "Timed Text Tracks" the web application has access to the caption text and display metadata that it can use to coordinate the visual display area with the underlying video. Presumably the media player in the user agent could do something similar.

In the case of a sign-language video track there does not appear to be any information that provides insight to the user agent as to how the signing video could/should be displayed on the underlying video. The option I suggested to address this is let the application have the control necessary to do the overlay. An alternative suggestion was to predefine how various kinds of video overlays should be presented so the player can do it.


David Singer
Multimedia and Software Standards, Apple Inc.

Received on Thursday, 24 February 2011 17:07:40 UTC