Re: Tech Discussions on the Multitrack Media (issue-152)

On Fri, Feb 25, 2011 at 12:12 PM, David Singer <singer@apple.com> wrote:
>
> On Feb 24, 2011, at 16:57 , Silvia Pfeiffer wrote:
>
>> On Fri, Feb 25, 2011 at 10:24 AM, David Singer <singer@apple.com> wrote:
>>>
>>> On Feb 24, 2011, at 8:57 , Bob Lund wrote:
>>>
>>> [Bob Lund] I agree with your observation that timed text, audio and video
>>> all layout a presentation along a timeline. In the context of HTML5, though,
>>> “Timed Text Tracks” expose “cues” with a start and end time, and data –
>>> either text or metadata. The proposed multi-track media APIs expose the
>>> presence of additional tracks along with the ability to denote whether the
>>> track is showing.. There is no “cue” and there is no access to the data in
>>> the track.
>>>
>>>
>>> True, access to data in the track would be useful for any media type, not
>>> just text.  (Audio processing, for example, or extracting an image from
>>> video and painting it onto a Canvas).  So rather than treating text tracks
>>> as special, I'd prefer to see all tracks treated powerfully enough to meet
>>> the text and other needs.
>>
>>
>> That's not really possible.
>>
>> The main feature of text tracks is that their data are sparse chunks
>> along the timeline with relatively little data, therefore it is
>> possible to parse all of this data into a cue list, keep it in memory
>> and make it available as a TextTrackCueList to JS, as well as throw an
>> event on the track when cues change, and on the activated and
>> deactivated cues themselves. We need this kind of flexibility on the
>> text cues to allow people to build their own interfaces around the
>> text cues.
>>
>> Introducing an API of this kind for audio and video tracks is,
>> however, not really possible. Assuming we take the concept of a "cue"
>> to mean a "group of samples" for audio and video, then we'd have to
>> hold a large amount of data in memory for an AudioCueList or a
>> VideoCueList. And since the individual cues would typically related to
>> a small amount of time (for video I would think they relate to a
>> frame, for audio maybe to 40ms), then as we play, we'd have constantly
>> firing events both on the track and the cues and we'd have to
>> constantly adapt the CueList making it not useful to the JS programmer
>> anyway and probably exploding the browser.
>>
>> So, basically, the audio and video API for data can only really be a
>> polling API, while for text it is and totally should be a push API.
>> Even with in-band streams: as soon as the browser finds a text cue, it
>> needs to make it available to the browser to add to the
>> TextTrackCueList, so that the browser and JS developer get sufficient
>> early notice and can do something with it.
>>
>> The discontinuous and sparse nature of text tracks make them a very
>> different beast to audio and video track. If we did want to deal with
>> audio and video tracks as well as text tracks through the TextTrack
>> API, I would think we can only make it such that this part of the
>> TextTrack API is disabled for audio and video tracks:
>>    readonly attribute TextTrackCueList cues;
>>    readonly attribute TextTrackCueList activeCues;
>>                   attribute Function oncuechange;
>>
>> I don't know what the downsides of such an approach would be, but it
>> certainly feels a bit clunky.
>>
>
> I agree that asking for all the times that a normal video track changes its data is likely to be a large list, and a similar request of a text track is likely to be smaller.  But there are uses for video tracks that are low frame-rate also, notably chapter images.  So to me, that suggests that callers should exercise care in calling this API (e.g. maybe check the frame rate, the rate at which data changes, before asking for the list).
>
> For audio, it's harder, I agree.  Asking for one sample in the uncompressed domain makes no sense.
>
> It would then work for text tracks which give chapter titles, for example.  Video tracks that are slide shows.  And all sorts of things.  "Tell me the times things change" and "Tell me the data at this time (or at the current time)" are applicable to almost anything except audio.


When you talk about videos that are slide shows, are you actually
talking about videos or about a sequence of images (photos) that are
also sparse along the timeline, so not really "moving images"? If they
are encoded as video, it would be impossible to distinguish the data
as a sequence of photos. The other case - basically an "image track"
is not something we've ever discussed before and is not something that
all containers have formats for FAIK.

As for chapter images - I am not sure how they are encoded in
QuickTime/MPEG, so if you know, please share. I would have thought
that a text track with image urls could be sufficient for this.

Cheers,
Silvia.

Received on Friday, 25 February 2011 01:29:54 UTC