W3C home > Mailing lists > Public > public-html-a11y@w3.org > March 2010

Re: Survey ready on Media Multitrack API proposal

From: Philip Jägenstedt <philipj@opera.com>
Date: Sat, 13 Mar 2010 12:20:14 +0800
To: "Dick Bulterman" <Dick.Bulterman@cwi.nl>, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
Cc: "Michael(tm) Smith" <mike@w3.org>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>
Message-ID: <op.u9hnn0akatwj1d@philip-pc>
On Sat, 13 Mar 2010 07:25:59 +0800, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> On Sat, Mar 13, 2010 at 12:32 AM, Dick Bulterman <Dick.Bulterman@cwi.nl>  
> wrote:
>> Hi Silvia,
>> You wrote:
>>> If we really wanted to have a possibility of compositing multimedia
>>> presentations from multiple resources in a flexible manner, we should
>>> not be using <audio> or <video> for it, but rather introduce SMIL -
>>> just like we are not compositing image resources in the <img> element,
>>> but in other elements, such as <canvas> or through JavaScript.
>> I think you are under-estimating what you are trying to achieve (as  
>> least as
>> far as I can follow the discussions so far).
>> If 'all' you were trying to do was to turn on one of the two or three
>> pre-recorded, pre-packaged, pre-composed optional tracks within an Ogg  
>> or
>> mp4 container, then I'd say: fine, this is a local media issue that can  
>> be
>> handled within the context of a single <video> or <audio> element. But  
>> this
>> is NOT what you are doing: you are referencing external text files (srt,
>> smilText, DFXP, whatever) -- these need to be composed temporally and
>> spatially with the content in the video or audio object. Once you head  
>> down
>> this path, you are no longer looking at local manipulations WITHIN a  
>> media
>> object, but activating objects from multiple sources.
> There is no difference if these tracks come from within a file or from
> external, except that with external tracks we have all the data
> available at once, while with internal tracks we get them in chunks
> together with the media data. In either case you have a pipeline for
> decoding the video, a pipeline for decoding the audio and a "pipeline"
> for decoding the text and your media system synchronises them. The
> need for temporal and spatial composition is not different whether the
> text comes from within the file as when it comes from external. In
> fact, it is easier to deal with the external case because its data is
> available fully from the start and temporal synchronisiation, seeking
> etc is easier to solve. Applications such as VLC, mplayer and many
> others that are able to synchronise external captions with media
> resources have shown it for years. They do not need a sophisticated
> framework for activating objects from multiple sources.
>> This is why I really believe that you need to look at a more scalable
>> solution to this problem -- not because I want to impose SMIL on you,  
>> but
>> because you are imposing temporal and spatial problems on yourself in  
>> the
>> context of composing separate media objects.
> Believe me I know what I am talking about - I have done this many
> times before. Going down your line of thought will introduce a
> complexity for the media elements that they are not built for and not
> intended for. An <img> element is not built for spatial combination of
> multiple images and text objects either, but it does have a solution
> for accessibility. This is our focus here, too.
> Your line of thought really has to be done by introducing a different
> element and it may indeed be an idea to consider something like a
> mediacanvas for composing separate media objects - the current canvas
> is not built for dealing with the temporal synchronisation issues that
> such a solution requires and neither are the current media elements.
>> As an aside: One of the benefits of this approach is that it means that  
>> you
>> get broader selectivity for free on other objects -- this increases
>> accessibility options at no extra cost. (Note that whether you view the
>> controlling syntax in declarative or scripted terms is an orthogonal
>> concern.)
> Now, I think you are under-estimating the complexity. Marking this up
> is the easy bit. But implementing it is the hard bit.
> There is indeed a massive "extra cost" for introducing <seq>, <par>,
> <excl> etc.. You need to introduce a whole media and network
> synchronisation sub-system that e.g. tracks what time we are up to in
> every single resource, whether they are all buffered correctly, no
> network connection is stalling, enough data available from each,
> determine which are to be activated based on several dimensions of
> conditions, make sure the overlapping is done correctly and what to do
> with gaps. As it stands, the number of events on a single media
> resource is already enormous (see
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#mediaevents).
> With multiple dynamically composed resources and no main resource this
> will explode exponentially. It is a complexity that comes at a high
> price and is overkill for the current media elements.
> In contrast, the current media elements have one master and it is
> providing the temporal timeline. Everything else synchronises to it.
> It provides the main audio or video and there are no gaps. Activation
> is simple and need to be calculated only once at the beginning of the
> playback and changed only when the user interacts. It's massively less
> complex.
> Honestly, keeping the composing of several media elements separate
> from dealing with basically a single media resource as we are right
> now is the right way to go. It follows the divided an conquer
> principle: once this is solved and solid, it will be easier to develop
> an approach to solving your requirements.

Thanks Silvia for explaining this better than I could have. Here is my  
view on the two issues mentioned:

1. Temporal composition

As Silvia says there is a main timeline to which everything is synced, no  
extra syntax is needed to express this. Without any examples or use cases  
I can only guess that there might be two other things of interest (in  

1.1. Scaling or offsetting the <track> relative to the original resource,  
e.g. to fix broken caption timing or include a sign-language video track  
for a small part of a long video.

I don't suggest including this now, but if we need it later the solution  
is simple:

<track offset="40" rate="1.5">

1.2. Concatenating several media resources back-to-back with gapless  

This would be a big implementation burden. If implementations aren't  
gapless, then it will be no better than doing it via script, as such:

<video id="video1" src="video1.ogv"  
<video id="video2" src="video2.ogv"></video>

In the very unlikely case that gapless playback becomes a must-have  
feature and implementors are actually willing to support it, we can just  
let <audio> and <video> sync to another timeline. The mediacanvas Silvia  
suggests can be the HTML document itself:

<video id="video1" src="video1.ogv"></video>
<video id="video2" src="video2.ogv" timeline="video1"  

I just made this up. We could copy whatever terminology SMIL uses if it is  
better. The point is, and I quote Silvia:

> Marking this up is the easy bit. But implementing it is the hard bit.

2. spatial composition

CSS, always!

Philip Jägenstedt
Core Developer
Opera Software
Received on Saturday, 13 March 2010 04:21:00 UTC

This archive was generated by hypermail 2.4.0 : Friday, 20 January 2023 19:58:54 UTC