Re: Survey ready on Media Multitrack API proposal from Philip Jägenstedt on 2010-03-16 (public-html-a11y@w3.org from March 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Tue, 16 Mar 2010 21:52:54 +0800
To: "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>, "Dick Bulterman" <Dick.Bulterman@cwi.nl>
Cc: "Michael(tm) Smith" <mike@w3.org>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>
Message-ID: <op.u9nx6ghbatwj1d@philip-pc.oslo.opera.com>
On Tue, 16 Mar 2010 17:26:01 +0800, Dick Bulterman <Dick.Bulterman@cwi.nl>  
wrote:

> Hi Silvia and Philip,
>
> Sorry for the late response: I'm in the middle of an EU project review
> that is soaking up my time.
>
> I realize that we've already spent more time writing about the
> alternatives than it would take to implement them, but I'd like to point
> out a few items of importance.
>
> First, I understand the urge to manage a complex problem with a simple
> solution. The complex problem is synchronizing media objects that have
> different internal time bases -- such as when the objects live in
> separate files.
>
> You write:
>> There is no difference if these tracks come from within a file or from
>> external, except that with external tracks we have all the data
>> available at once, while with internal tracks we get them in chunks
>> together with the media data. In either case you have a pipeline for
>> decoding the video, a pipeline for decoding the audio and a "pipeline"
>> for decoding the text and your media system synchronises them. The
>> need for temporal and spatial composition is not different whether the
>> text comes from within the file as when it comes from external. In
>> fact, it is easier to deal with the external case because its data is
>> available fully from the start and temporal synchronisiation, seeking
>> etc is easier to solve.
>
> Thanks for the primer on basic synchronization, but what you say is not
> really true: there is a fundamental difference between managing content
> within a single container and across independent objects. When all media
> is in one container, you can get away with making lots of simplifying
> assumptions on timing. For example, if you are streaming the content and
> there is a network delay, ALL of the media is delayed. This is not true
> if they are in separate containers, coming from separate sources (on
> separate servers, etc.) In the case of separate files, if there is a
> delay in one object (say the video), you need to know:
> - who is the sync master
> - how tight is the synchronization relationship
> (If you don't know this, you can't do any form of intelligent recovery.)
> You then need to decide who gets blocked, or which content get skipped
> in order to get things back in sync. Also, if durations don't match,  
> you've got to decide if something gets cut off, or extended, or slowed  
> down, or sped up. For simple text, there are short cuts, but if you want  
> to support broader forms of accessible content (such as audio captions,  
> or semantically scalable content), there are no easy one-size-fits-all  
> solutions.
>
> In your current proposal, the <video>/<audio> objects have a dual role  
> as both media objects and time containers. In my view, these roles  
> should be separated. Of course, whether you introduce a separate timing  
> container (like par, seq, excl) or overload the video and audio objects  
> with time container semantics (such as you propose), you STILL need to  
> worry about these issues. The advantage of using a separate time  
> container is that you get to reuse the implementation work -- and that  
> the video and audio elements can simply worry about their functional  
> roles. EVEN if you initially restrict the scope of the time containers  
> to managing video or audio, you have cleanly separated the notions of  
> presentation structure and object functionality. You can then much more  
> cleanly extend support for things like audio captions for the blind, or  
> separate rendering output streams for assistive devices. If you bundle  
> this functionly with audio or video elements, you simply have to do the  
> same work over and over again.
>
>
>> Applications such as VLC, mplayer and many
>> others that are able to synchronise external captions with media
>> resources have shown it for years. They do not need a sophisticated
>> framework for activating objects from multiple sources.
>
> One of the reasons that this works for VLC is that they are NOT also
> rendering a full page's content: the entire temporal scope is restricted
> to only the video -- that's all they do. An HTML5 browser does a whole
> lot more (such as perhaps managing multiple <video>/<audio> elements on
> a page). It is also the reason that you can only do timeline  
> manipulation in VLC: there is no structure, so you can't do  
> content-based navigation, selective content inclusion, or any kind of  
> adaptability for people with different needs. A missed opportunity.
>
>>> > As an aside: One of the benefits of this approach is that it means  
>>> that you
>>> > get broader selectivity for free on other objects -- this increases
>>> > accessibility options at no extra cost. (Note that whether you view  
>>> the
>>> > controlling syntax in declarative or scripted terms is an orthogonal
>>> > concern.)
>>  Now, I think you are under-estimating the complexity. Marking this up
>> is the easy bit. But implementing it is the hard bit.
>
> You know, people have been building these things for 15 years in desktop  
> players, in telephones, in set-top-boxes, in handheld media players. It  
> isn't rocket science -- but it does require looking a bit further than  
> the easiest possible path. (Have you ever used HTML+Time in IE, or
> integrated Petri's timesheets, or looked at the dozen or so JavaScript  
> SMIL timing implementations? These provide examples of syntax and
> implementations.)
>
> There is a misperception that supporting time containers as top-level
> objects makes life harder. This is not true: localizing timing behavior
> to time containers actually makes you life easier! It separates content
> from control, and it provides a systematic model that works for all
> sorts of media. Combining timing and content containers -- although
> seemingly easier -- simply means that there is no growth path.
>
>> Honestly, keeping the composing of several media elements separate
>> from dealing with basically a single media resource as we are right
>> now is the right way to go. It follows the divided an conquer
>> principle: once this is solved and solid, it will be easier to develop
>> an approach to solving your requirements.
>
> If you select special-purpose 'solutions' that don't scale, you are not
> using divide&conquer -- you are forcing the next generation of
> developers to invent work-arounds because you have several inconsistent
> timing models in a document. Without a consistent model, you are
> creating throw-away solutions. Now THAT's a waste of effort, since you
> have to implement the temporal semantics anyway -- the problems don't go
> away.

I'll try to not repeat what Silvia said, but will note that agree with  
her, especially regarding her suggestions about how to move this forward.

Please outline exactly what syntax it is you are suggesting, with HTML  
examples. Also, please explain the precise use cases you have in mind and  
why they are specifically related to a11y. If they aren't a11y-specific,  
then we should hand the <track> proposal over to the main HTML WG and  
continue the discussion there.

Some random comments about synchronization:

No matter what syntax we use, implementors will have to sync video with a  
text track from an external resources. While text tracks logically have  
their own timeline, actual implementations will be driven by the video  
clock. I see no reason to think that there would be a concept of an  
independently playing text track that could *not* be in sync with the  
video.

 From a theoretical point of view, it's not insane to think of <track> and  
<video> as being independent resources and wanting to explicitly bind them  
together, perhaps like this:

<video id="v"></video>
<track clockref="v"></track>

We might even want to do that for <audio> and <video> eventually, and you  
might even add a virtual clock without an actual resource, etc.

However, I can't see that any of this makes any difference to the problem  
at hand. It's very natural to be able to associate a <track> with a  
<video> just by having it as a child element, and if we ever want to do it  
differently it seems very easy to solve, syntax-wise. In the meantime,  
implementors obviously won't do more work than required to support the  
syntax we have.

If you still think that the suggested model will be impossible to extend  
to support the use cases you have, please say precisely what problems you  
think we will encounter.

I don't mean to sound dismissive, don't pretend to know everything and  
would like to be corrected where my assumptions or conclusions are  
incorrect, in very specific terms so that even I can understand.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Tuesday, 16 March 2010 13:53:44 UTC