W3C home > Mailing lists > Public > public-html-a11y@w3.org > March 2010

Re: Survey ready on Media Multitrack API proposal

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Sat, 13 Mar 2010 10:25:59 +1100
Message-ID: <2c0e02831003121525p2ee9df1bm1af5c129e57eec62@mail.gmail.com>
To: Dick Bulterman <Dick.Bulterman@cwi.nl>
Cc: Philip Jägenstedt <philipj@opera.com>, "Michael(tm) Smith" <mike@w3.org>, HTML Accessibility Task Force <public-html-a11y@w3.org>
On Sat, Mar 13, 2010 at 12:32 AM, Dick Bulterman <Dick.Bulterman@cwi.nl> wrote:
> Hi Silvia,
> You wrote:
>> If we really wanted to have a possibility of compositing multimedia
>> presentations from multiple resources in a flexible manner, we should
>> not be using <audio> or <video> for it, but rather introduce SMIL -
>> just like we are not compositing image resources in the <img> element,
>> but in other elements, such as <canvas> or through JavaScript.
> I think you are under-estimating what you are trying to achieve (as least as
> far as I can follow the discussions so far).
> If 'all' you were trying to do was to turn on one of the two or three
> pre-recorded, pre-packaged, pre-composed optional tracks within an Ogg or
> mp4 container, then I'd say: fine, this is a local media issue that can be
> handled within the context of a single <video> or <audio> element. But this
> is NOT what you are doing: you are referencing external text files (srt,
> smilText, DFXP, whatever) -- these need to be composed temporally and
> spatially with the content in the video or audio object. Once you head down
> this path, you are no longer looking at local manipulations WITHIN a media
> object, but activating objects from multiple sources.

There is no difference if these tracks come from within a file or from
external, except that with external tracks we have all the data
available at once, while with internal tracks we get them in chunks
together with the media data. In either case you have a pipeline for
decoding the video, a pipeline for decoding the audio and a "pipeline"
for decoding the text and your media system synchronises them. The
need for temporal and spatial composition is not different whether the
text comes from within the file as when it comes from external. In
fact, it is easier to deal with the external case because its data is
available fully from the start and temporal synchronisiation, seeking
etc is easier to solve. Applications such as VLC, mplayer and many
others that are able to synchronise external captions with media
resources have shown it for years. They do not need a sophisticated
framework for activating objects from multiple sources.

> This is why I really believe that you need to look at a more scalable
> solution to this problem -- not because I want to impose SMIL on you, but
> because you are imposing temporal and spatial problems on yourself in the
> context of composing separate media objects.

Believe me I know what I am talking about - I have done this many
times before. Going down your line of thought will introduce a
complexity for the media elements that they are not built for and not
intended for. An <img> element is not built for spatial combination of
multiple images and text objects either, but it does have a solution
for accessibility. This is our focus here, too.

Your line of thought really has to be done by introducing a different
element and it may indeed be an idea to consider something like a
mediacanvas for composing separate media objects - the current canvas
is not built for dealing with the temporal synchronisation issues that
such a solution requires and neither are the current media elements.

> As an aside: One of the benefits of this approach is that it means that you
> get broader selectivity for free on other objects -- this increases
> accessibility options at no extra cost. (Note that whether you view the
> controlling syntax in declarative or scripted terms is an orthogonal
> concern.)

Now, I think you are under-estimating the complexity. Marking this up
is the easy bit. But implementing it is the hard bit.

There is indeed a massive "extra cost" for introducing <seq>, <par>,
<excl> etc.. You need to introduce a whole media and network
synchronisation sub-system that e.g. tracks what time we are up to in
every single resource, whether they are all buffered correctly, no
network connection is stalling, enough data available from each,
determine which are to be activated based on several dimensions of
conditions, make sure the overlapping is done correctly and what to do
with gaps. As it stands, the number of events on a single media
resource is already enormous (see
With multiple dynamically composed resources and no main resource this
will explode exponentially. It is a complexity that comes at a high
price and is overkill for the current media elements.

In contrast, the current media elements have one master and it is
providing the temporal timeline. Everything else synchronises to it.
It provides the main audio or video and there are no gaps. Activation
is simple and need to be calculated only once at the beginning of the
playback and changed only when the user interacts. It's massively less

Honestly, keeping the composing of several media elements separate
from dealing with basically a single media resource as we are right
now is the right way to go. It follows the divided an conquer
principle: once this is solved and solid, it will be easier to develop
an approach to solving your requirements.

Received on Friday, 12 March 2010 23:26:53 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:55:33 UTC