[media] handling multitrack audio / video

Hi all,

This is to start a technical discussion on how to solve the multitrack
audio / video requirements in HTML5.

We've got the following related bug and I want to make a start on
discussing the advantages / disadvantages of different approaches:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9452

Ian's comment on this was this - and I agree that his conclusion
should be a general goal in the technical solution that we eventually
propose:
> The ability to control multiple internal media tracks (sign language video
> overlays, alternate angles, dubbed audio, etc) seems like something we'd want
> to do in a way consistent with handling of multiple external tracks, much like
> how internal subtitle tracks and external subtitle tracks should use the same
> mechanism so that they can be enabled and disabled and generally manipulated in
> a consistent way.

I can think of the following different mark-up approaches towards
solving this issue:


1. Overload <track>

For example synchronizing external audio description and sign language
video with main video:
<video id="v1" poster=“video.png” controls>
  <source src=“video.ogv” type=”video/ogg”>
  <source src=“video.mp4” type=”video/mp4”>
  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
  <track src="audesc.ogg" kind="descriptions" type="audio/ogg"
srclang="en" label="English Audio Description">
  <track src="signlang.ogv" kind="signings" type="video/ogg"
srclang="asl" label="American Sign Language">
</video>

This adds a @type attribute to the <track> element, allowing it to
also be used with audio and video and not just text tracks.

There are a number of problems with such an approach:

* How do we reference alternative encodings?
   It would probably require the introduction of <source> elements
inside <track>, making <track> more complex for selecting currentSrc
etc. Also, if we needed different encodings for different devices, a
@media attribute will be necessary.

* How do we control synchronization issues?
   The main resource would probably always be the one whose timeline
dominates and for the others we do a best effort to keep in sync with
that one. So, what happens if a user wants to not miss anything from
one of the auxiliary tracks, e.g. wants the sign language track to be
the time keeper? That's not possible with this approach.

* How do we design the JavaScript API?
  There are no cues, so TimedTrack cues and activeCues would  be empty
elements and the cuechange would not ever be activated. The audio and
video tracks will be in the same TimedTrack list as the text ones and
possibly creating confusion for example in a accessibility menu for
track selection, in particular where the track @kind goes beyond mere
accessibility such as alternate viewing angles or director's comment.

* What about other a/v related features, such as width/height and
placement of the sign language video or volume of the audio
description?
  Having control over such extra features would be rather difficult to
specify, since the data is only regarded as an abstract alternative
content to the main video. The rendering algorithm would become a lot
more complex and attributes from audio and video elements may be
necessary to introduce onto the <track> element, too. It seems that
would lead to quite some duplication of functionality between
different elements.


2. Introduce <audiotrack> and <videotrack>

Instead of overloading <track>, one could consider creating new track
elements for audio and video, such as <audiotrack> and <videotrack>.

This allows keeping different attributes on these elements and having
audio / video / text track lists separate in JavaScript.

Also, it allows for <source> elements inside <track> more easily, e.g.:
<video id="v1" poster=“video.png” controls>
  <source src=“video.ogv” type=”video/ogg”>
  <source src=“video.mp4” type=”video/mp4”>
  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
  <audiotrack kind=”descriptions” srclang=”en”>
    <source src=”description.ogg” type=”audio/ogg”>
    <source src=”description.mp3” type=”audio/mp3”>
  </audiotrack>
</video>
But fundamentally we have the same issues as with approach 1, in
particular a replication need of some of the audio / video
functionality from the <audio> and <video> elements.


3. Introduce a <par>-like element

The fundamental challenge that we are facing is to find a way to
synchronise multiple audio-visual media resources, be that from
in-band where the overall timeline is clear or be that with separate
external resources where the overall timeline has to be defined. Then
we are suddenly not talking any more about a master resource and
auxiliary resources, but audio-visual resources that are equals. This
is more along the SMIL way of thinking, which is why I called this
section the "<par>-like element".

An example markup for synchronizing external audio description and
sign language video with a main video could then be something like:
<par>
  <video id="v1" poster=“video.png” controls>
    <source src=“video.ogv” type=”video/ogg”>
    <source src=“video.mp4” type=”video/mp4”>
    <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
    <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
    <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
  </video>
  <audio controls>
    <source src="audesc.ogg" type="audio/ogg">
    <source src="audesc.mp3" type="audio/mp3">
  </audio>
  <video controls>
    <source src="signing.ogv" type="video/ogg">
    <source src="signing.mp4" type="video/mp4">
  </video>
</par>

This synchronisation element could of course be called something else:
<mastertime>, <coordinator>, <sync>, <timeline>, <container>, <timemaster> etc.

The synchronisation element needs to provide the main timeline. It
would make sure that the elements play and seek in parallel.

Audio and video elements can then be styled individually as their own
CSS block elements and deactivated with "display: none".

The sync element could have an attribute to decide whether to have
drop-outs in elements if the main timeline progresses, but some
contained elements starved, or whether to go into overall buffering
mode if one of the elements goes into buffering mode. It could also
define one as the main element whose timeline should not be ignored
and the others as slaves for which buffering situations would be
ignored. Something like @synchronize=[block/ignore] and @master="v1"
attributes.

Also, a decision would need to be made about what to do with
@controls. Should there be a controls display on the first/master
element if any of them has a @controls attribute? Should the slave
elements not have controls displayed?


4. Nest media elements

An alternative means of re-using <audio> and <video> elements for
synchronisation is to put the "slave" elements inside the "master"
element like so:

<video id="v1" poster=“video.png” controls>
  <source src=“video.ogv” type=”video/ogg”>
  <source src=“video.mp4” type=”video/mp4”>
  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
  <par>
    <audio controls>
      <source src="audesc.ogg" type="audio/ogg">
      <source src="audesc.mp3" type="audio/mp3">
    </audio>
    <video controls>
      <source src="signing.ogv" type="video/ogg">
      <source src="signing.mp4" type="video/mp4">
    </video>
  </par>
</video>

This makes clear whose timeline the element is following. But it sure
looks recursive and we would have to define that elements inside a
<par> cannot have another <par> inside them to stop that.

===

These are some of the thoughts I had on this topic. I am not yet
decided on which of the above proposals - or an alternative proposal -
makes the most sense. I have a gut feeling that it is probably useful
to be able to define both, a dominant container for synchronization
and one where all containers are valued the same. So, maybe the third
approach would be the most flexible, but it certainly needs a bit more
thinking.

Cheers,
Silvia.

Received on Tuesday, 19 October 2010 17:52:30 UTC