Re: Survey ready on Media Multitrack API proposal

Hi Silvia and Philip,

Sorry for the late response: I'm in the middle of an EU project review
that is soaking up my time.

I realize that we've already spent more time writing about the
alternatives than it would take to implement them, but I'd like to point
out a few items of importance.

First, I understand the urge to manage a complex problem with a simple
solution. The complex problem is synchronizing media objects that have
different internal time bases -- such as when the objects live in
separate files.

You write:
> There is no difference if these tracks come from within a file or from
> external, except that with external tracks we have all the data
> available at once, while with internal tracks we get them in chunks
> together with the media data. In either case you have a pipeline for
> decoding the video, a pipeline for decoding the audio and a "pipeline"
> for decoding the text and your media system synchronises them. The
> need for temporal and spatial composition is not different whether the
> text comes from within the file as when it comes from external. In
> fact, it is easier to deal with the external case because its data is
> available fully from the start and temporal synchronisiation, seeking
> etc is easier to solve. 

Thanks for the primer on basic synchronization, but what you say is not
really true: there is a fundamental difference between managing content
within a single container and across independent objects. When all media
is in one container, you can get away with making lots of simplifying
assumptions on timing. For example, if you are streaming the content and
there is a network delay, ALL of the media is delayed. This is not true
if they are in separate containers, coming from separate sources (on
separate servers, etc.) In the case of separate files, if there is a
delay in one object (say the video), you need to know:
- who is the sync master
- how tight is the synchronization relationship
(If you don't know this, you can't do any form of intelligent recovery.)
You then need to decide who gets blocked, or which content get skipped
in order to get things back in sync. Also, if durations don't match, 
you've got to decide if something gets cut off, or extended, or slowed 
down, or sped up. For simple text, there are short cuts, but if you want 
to support broader forms of accessible content (such as audio captions, 
or semantically scalable content), there are no easy one-size-fits-all 
solutions.

In your current proposal, the <video>/<audio> objects have a dual role 
as both media objects and time containers. In my view, these roles 
should be separated. Of course, whether you introduce a separate timing 
container (like par, seq, excl) or overload the video and audio objects 
with time container semantics (such as you propose), you STILL need to 
worry about these issues. The advantage of using a separate time 
container is that you get to reuse the implementation work -- and that 
the video and audio elements can simply worry about their functional 
roles. EVEN if you initially restrict the scope of the time containers 
to managing video or audio, you have cleanly separated the notions of 
presentation structure and object functionality. You can then much more 
cleanly extend support for things like audio captions for the blind, or 
separate rendering output streams for assistive devices. If you bundle 
this functionly with audio or video elements, you simply have to do the 
same work over and over again.


> Applications such as VLC, mplayer and many
> others that are able to synchronise external captions with media
> resources have shown it for years. They do not need a sophisticated
> framework for activating objects from multiple sources.

One of the reasons that this works for VLC is that they are NOT also
rendering a full page's content: the entire temporal scope is restricted
to only the video -- that's all they do. An HTML5 browser does a whole
lot more (such as perhaps managing multiple <video>/<audio> elements on
a page). It is also the reason that you can only do timeline 
manipulation in VLC: there is no structure, so you can't do 
content-based navigation, selective content inclusion, or any kind of 
adaptability for people with different needs. A missed opportunity.

>> > As an aside: One of the benefits of this approach is that it means that you
>> > get broader selectivity for free on other objects -- this increases
>> > accessibility options at no extra cost. (Note that whether you view the
>> > controlling syntax in declarative or scripted terms is an orthogonal
>> > concern.)
> 
> Now, I think you are under-estimating the complexity. Marking this up
> is the easy bit. But implementing it is the hard bit.

You know, people have been building these things for 15 years in desktop 
players, in telephones, in set-top-boxes, in handheld media players. It 
isn't rocket science -- but it does require looking a bit further than 
the easiest possible path. (Have you ever used HTML+Time in IE, or
integrated Petri's timesheets, or looked at the dozen or so JavaScript 
SMIL timing implementations? These provide examples of syntax and
implementations.)

There is a misperception that supporting time containers as top-level
objects makes life harder. This is not true: localizing timing behavior
to time containers actually makes you life easier! It separates content
from control, and it provides a systematic model that works for all
sorts of media. Combining timing and content containers -- although
seemingly easier -- simply means that there is no growth path.

> Honestly, keeping the composing of several media elements separate
> from dealing with basically a single media resource as we are right
> now is the right way to go. It follows the divided an conquer
> principle: once this is solved and solid, it will be easier to develop
> an approach to solving your requirements.

If you select special-purpose 'solutions' that don't scale, you are not
using divide&conquer -- you are forcing the next generation of
developers to invent work-arounds because you have several inconsistent
timing models in a document. Without a consistent model, you are
creating throw-away solutions. Now THAT's a waste of effort, since you
have to implement the temporal semantics anyway -- the problems don't go
away.

cheers,
-d.

Received on Tuesday, 16 March 2010 10:06:04 UTC