Re: Survey ready on Media Multitrack API proposal from Philip Jägenstedt on 2010-03-13 (public-html-a11y@w3.org from March 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Sat, 13 Mar 2010 12:20:14 +0800
To: "Dick Bulterman" <Dick.Bulterman@cwi.nl>, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
Cc: "Michael(tm) Smith" <mike@w3.org>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>
Message-ID: <op.u9hnn0akatwj1d@philip-pc>
On Sat, 13 Mar 2010 07:25:59 +0800, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> On Sat, Mar 13, 2010 at 12:32 AM, Dick Bulterman <Dick.Bulterman@cwi.nl>  
> wrote:
>> Hi Silvia,
>>
>> You wrote:
>>>
>>> If we really wanted to have a possibility of compositing multimedia
>>> presentations from multiple resources in a flexible manner, we should
>>> not be using <audio> or <video> for it, but rather introduce SMIL -
>>> just like we are not compositing image resources in the <img> element,
>>> but in other elements, such as <canvas> or through JavaScript.
>>
>> I think you are under-estimating what you are trying to achieve (as  
>> least as
>> far as I can follow the discussions so far).
>>
>> If 'all' you were trying to do was to turn on one of the two or three
>> pre-recorded, pre-packaged, pre-composed optional tracks within an Ogg  
>> or
>> mp4 container, then I'd say: fine, this is a local media issue that can  
>> be
>> handled within the context of a single <video> or <audio> element. But  
>> this
>> is NOT what you are doing: you are referencing external text files (srt,
>> smilText, DFXP, whatever) -- these need to be composed temporally and
>> spatially with the content in the video or audio object. Once you head  
>> down
>> this path, you are no longer looking at local manipulations WITHIN a  
>> media
>> object, but activating objects from multiple sources.
>
> There is no difference if these tracks come from within a file or from
> external, except that with external tracks we have all the data
> available at once, while with internal tracks we get them in chunks
> together with the media data. In either case you have a pipeline for
> decoding the video, a pipeline for decoding the audio and a "pipeline"
> for decoding the text and your media system synchronises them. The
> need for temporal and spatial composition is not different whether the
> text comes from within the file as when it comes from external. In
> fact, it is easier to deal with the external case because its data is
> available fully from the start and temporal synchronisiation, seeking
> etc is easier to solve. Applications such as VLC, mplayer and many
> others that are able to synchronise external captions with media
> resources have shown it for years. They do not need a sophisticated
> framework for activating objects from multiple sources.
>
>
>> This is why I really believe that you need to look at a more scalable
>> solution to this problem -- not because I want to impose SMIL on you,  
>> but
>> because you are imposing temporal and spatial problems on yourself in  
>> the
>> context of composing separate media objects.
>
> Believe me I know what I am talking about - I have done this many
> times before. Going down your line of thought will introduce a
> complexity for the media elements that they are not built for and not
> intended for. An <img> element is not built for spatial combination of
> multiple images and text objects either, but it does have a solution
> for accessibility. This is our focus here, too.
>
> Your line of thought really has to be done by introducing a different
> element and it may indeed be an idea to consider something like a
> mediacanvas for composing separate media objects - the current canvas
> is not built for dealing with the temporal synchronisation issues that
> such a solution requires and neither are the current media elements.
>
>
>> As an aside: One of the benefits of this approach is that it means that  
>> you
>> get broader selectivity for free on other objects -- this increases
>> accessibility options at no extra cost. (Note that whether you view the
>> controlling syntax in declarative or scripted terms is an orthogonal
>> concern.)
>
> Now, I think you are under-estimating the complexity. Marking this up
> is the easy bit. But implementing it is the hard bit.
>
> There is indeed a massive "extra cost" for introducing <seq>, <par>,
> <excl> etc.. You need to introduce a whole media and network
> synchronisation sub-system that e.g. tracks what time we are up to in
> every single resource, whether they are all buffered correctly, no
> network connection is stalling, enough data available from each,
> determine which are to be activated based on several dimensions of
> conditions, make sure the overlapping is done correctly and what to do
> with gaps. As it stands, the number of events on a single media
> resource is already enormous (see
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#mediaevents).
> With multiple dynamically composed resources and no main resource this
> will explode exponentially. It is a complexity that comes at a high
> price and is overkill for the current media elements.
>
> In contrast, the current media elements have one master and it is
> providing the temporal timeline. Everything else synchronises to it.
> It provides the main audio or video and there are no gaps. Activation
> is simple and need to be calculated only once at the beginning of the
> playback and changed only when the user interacts. It's massively less
> complex.
>
> Honestly, keeping the composing of several media elements separate
> from dealing with basically a single media resource as we are right
> now is the right way to go. It follows the divided an conquer
> principle: once this is solved and solid, it will be easier to develop
> an approach to solving your requirements.

Thanks Silvia for explaining this better than I could have. Here is my  
view on the two issues mentioned:

1. Temporal composition

As Silvia says there is a main timeline to which everything is synced, no  
extra syntax is needed to express this. Without any examples or use cases  
I can only guess that there might be two other things of interest (in  
theory):

1.1. Scaling or offsetting the <track> relative to the original resource,  
e.g. to fix broken caption timing or include a sign-language video track  
for a small part of a long video.

I don't suggest including this now, but if we need it later the solution  
is simple:

<track offset="40" rate="1.5">

1.2. Concatenating several media resources back-to-back with gapless  
playback.

This would be a big implementation burden. If implementations aren't  
gapless, then it will be no better than doing it via script, as such:

<video id="video1" src="video1.ogv"  
onended="document.getElementById('video2').play()"></video>
<video id="video2" src="video2.ogv"></video>

In the very unlikely case that gapless playback becomes a must-have  
feature and implementors are actually willing to support it, we can just  
let <audio> and <video> sync to another timeline. The mediacanvas Silvia  
suggests can be the HTML document itself:

<video id="video1" src="video1.ogv"></video>
<video id="video2" src="video2.ogv" timeline="video1"  
offset="end:0"></video>

I just made this up. We could copy whatever terminology SMIL uses if it is  
better. The point is, and I quote Silvia:

> Marking this up is the easy bit. But implementing it is the hard bit.

2. spatial composition

CSS, always!

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Saturday, 13 March 2010 04:21:00 UTC