Re: [media] handling multitrack audio / video

On Tue, 19 Oct 2010 19:51:31 +0200, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> Hi all,
>
> This is to start a technical discussion on how to solve the multitrack
> audio / video requirements in HTML5.
>
> We've got the following related bug and I want to make a start on
> discussing the advantages / disadvantages of different approaches:
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=9452
>
> Ian's comment on this was this - and I agree that his conclusion
> should be a general goal in the technical solution that we eventually
> propose:
>> The ability to control multiple internal media tracks (sign language  
>> video
>> overlays, alternate angles, dubbed audio, etc) seems like something  
>> we'd want
>> to do in a way consistent with handling of multiple external tracks,  
>> much like
>> how internal subtitle tracks and external subtitle tracks should use  
>> the same
>> mechanism so that they can be enabled and disabled and generally  
>> manipulated in
>> a consistent way.
>
> I can think of the following different mark-up approaches towards
> solving this issue:
>
>
> 1. Overload <track>
>
> For example synchronizing external audio description and sign language
> video with main video:
> <video id="v1" poster=“video.png” controls>
>   <source src=“video.ogv” type=”video/ogg”>
>   <source src=“video.mp4” type=”video/mp4”>
>   <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>   <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>   <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>   <track src="audesc.ogg" kind="descriptions" type="audio/ogg"
> srclang="en" label="English Audio Description">
>   <track src="signlang.ogv" kind="signings" type="video/ogg"
> srclang="asl" label="American Sign Language">
> </video>
>
> This adds a @type attribute to the <track> element, allowing it to
> also be used with audio and video and not just text tracks.
>
> There are a number of problems with such an approach:
>
> * How do we reference alternative encodings?
>    It would probably require the introduction of <source> elements
> inside <track>, making <track> more complex for selecting currentSrc
> etc. Also, if we needed different encodings for different devices, a
> @media attribute will be necessary.
>
> * How do we control synchronization issues?
>    The main resource would probably always be the one whose timeline
> dominates and for the others we do a best effort to keep in sync with
> that one. So, what happens if a user wants to not miss anything from
> one of the auxiliary tracks, e.g. wants the sign language track to be
> the time keeper? That's not possible with this approach.
>
> * How do we design the JavaScript API?
>   There are no cues, so TimedTrack cues and activeCues would  be empty
> elements and the cuechange would not ever be activated. The audio and
> video tracks will be in the same TimedTrack list as the text ones and
> possibly creating confusion for example in a accessibility menu for
> track selection, in particular where the track @kind goes beyond mere
> accessibility such as alternate viewing angles or director's comment.
>
> * What about other a/v related features, such as width/height and
> placement of the sign language video or volume of the audio
> description?
>   Having control over such extra features would be rather difficult to
> specify, since the data is only regarded as an abstract alternative
> content to the main video. The rendering algorithm would become a lot
> more complex and attributes from audio and video elements may be
> necessary to introduce onto the <track> element, too. It seems that
> would lead to quite some duplication of functionality between
> different elements.
>
>
> 2. Introduce <audiotrack> and <videotrack>
>
> Instead of overloading <track>, one could consider creating new track
> elements for audio and video, such as <audiotrack> and <videotrack>.
>
> This allows keeping different attributes on these elements and having
> audio / video / text track lists separate in JavaScript.
>
> Also, it allows for <source> elements inside <track> more easily, e.g.:
> <video id="v1" poster=“video.png” controls>
>   <source src=“video.ogv” type=”video/ogg”>
>   <source src=“video.mp4” type=”video/mp4”>
>   <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>   <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>   <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>   <audiotrack kind=”descriptions” srclang=”en”>
>     <source src=”description.ogg” type=”audio/ogg”>
>     <source src=”description.mp3” type=”audio/mp3”>
>   </audiotrack>
> </video>
> But fundamentally we have the same issues as with approach 1, in
> particular a replication need of some of the audio / video
> functionality from the <audio> and <video> elements.
>
>
> 3. Introduce a <par>-like element
>
> The fundamental challenge that we are facing is to find a way to
> synchronise multiple audio-visual media resources, be that from
> in-band where the overall timeline is clear or be that with separate
> external resources where the overall timeline has to be defined. Then
> we are suddenly not talking any more about a master resource and
> auxiliary resources, but audio-visual resources that are equals. This
> is more along the SMIL way of thinking, which is why I called this
> section the "<par>-like element".
>
> An example markup for synchronizing external audio description and
> sign language video with a main video could then be something like:
> <par>
>   <video id="v1" poster=“video.png” controls>
>     <source src=“video.ogv” type=”video/ogg”>
>     <source src=“video.mp4” type=”video/mp4”>
>     <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>     <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>     <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>   </video>
>   <audio controls>
>     <source src="audesc.ogg" type="audio/ogg">
>     <source src="audesc.mp3" type="audio/mp3">
>   </audio>
>   <video controls>
>     <source src="signing.ogv" type="video/ogg">
>     <source src="signing.mp4" type="video/mp4">
>   </video>
> </par>
>
> This synchronisation element could of course be called something else:
> <mastertime>, <coordinator>, <sync>, <timeline>, <container>,  
> <timemaster> etc.
>
> The synchronisation element needs to provide the main timeline. It
> would make sure that the elements play and seek in parallel.
>
> Audio and video elements can then be styled individually as their own
> CSS block elements and deactivated with "display: none".
>
> The sync element could have an attribute to decide whether to have
> drop-outs in elements if the main timeline progresses, but some
> contained elements starved, or whether to go into overall buffering
> mode if one of the elements goes into buffering mode. It could also
> define one as the main element whose timeline should not be ignored
> and the others as slaves for which buffering situations would be
> ignored. Something like @synchronize=[block/ignore] and @master="v1"
> attributes.
>
> Also, a decision would need to be made about what to do with
> @controls. Should there be a controls display on the first/master
> element if any of them has a @controls attribute? Should the slave
> elements not have controls displayed?
>
>
> 4. Nest media elements
>
> An alternative means of re-using <audio> and <video> elements for
> synchronisation is to put the "slave" elements inside the "master"
> element like so:
>
> <video id="v1" poster=“video.png” controls>
>   <source src=“video.ogv” type=”video/ogg”>
>   <source src=“video.mp4” type=”video/mp4”>
>   <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>   <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>   <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>   <par>
>     <audio controls>
>       <source src="audesc.ogg" type="audio/ogg">
>       <source src="audesc.mp3" type="audio/mp3">
>     </audio>
>     <video controls>
>       <source src="signing.ogv" type="video/ogg">
>       <source src="signing.mp4" type="video/mp4">
>     </video>
>   </par>
> </video>
>
> This makes clear whose timeline the element is following. But it sure
> looks recursive and we would have to define that elements inside a
> <par> cannot have another <par> inside them to stop that.
>
> ===
>
> These are some of the thoughts I had on this topic. I am not yet
> decided on which of the above proposals - or an alternative proposal -
> makes the most sense. I have a gut feeling that it is probably useful
> to be able to define both, a dominant container for synchronization
> and one where all containers are valued the same. So, maybe the third
> approach would be the most flexible, but it certainly needs a bit more
> thinking.
>
> Cheers,
> Silvia.
>

I think that if we want to synchronize several video tracks with  
non-trivial styling, then the only sensible option is to have multiple  
<video> elements which are linked together by some attribute. Otherwise  
we'd be limited to displaying one video over the other, or similar. A  
benefit of this approach is that it's easy to fake to within 100s of  
milliseconds in existing browsers, while <audiotrack> or nested <video>s  
would require more elaborate tricks to emulate (much like <track>).

I can see the requirements on what to synchronize having a rather serious  
impact on the complexity. Mainly, these are the options:

1. Only synchronize tracks at their starting points, typically for extra  
audio tracks. This is very much like <track>.

2. Synchronize tracks at arbitrary offsets, including synchronizing the  
end of one track to the start of another. This is rather more SMIL-like.

For option 1, something like this would do:

<video id="bla"></video>
<video sync="bla"></video>

For option 2, things would be rather more complicated and I'm not going to  
make suggestions unless it's clear that we need it.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Thursday, 28 October 2010 07:13:19 UTC