Re: [media] handling multitrack audio / video

On Thu, Oct 28, 2010 at 6:12 PM, Philip Jägenstedt <philipj@opera.com> wrote:
> On Tue, 19 Oct 2010 19:51:31 +0200, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com> wrote:
>
>> Hi all,
>>
>> This is to start a technical discussion on how to solve the multitrack
>> audio / video requirements in HTML5.
>>
>> We've got the following related bug and I want to make a start on
>> discussing the advantages / disadvantages of different approaches:
>> http://www.w3.org/Bugs/Public/show_bug.cgi?id=9452
>>
>> Ian's comment on this was this - and I agree that his conclusion
>> should be a general goal in the technical solution that we eventually
>> propose:
>>>
>>> The ability to control multiple internal media tracks (sign language
>>> video
>>> overlays, alternate angles, dubbed audio, etc) seems like something we'd
>>> want
>>> to do in a way consistent with handling of multiple external tracks, much
>>> like
>>> how internal subtitle tracks and external subtitle tracks should use the
>>> same
>>> mechanism so that they can be enabled and disabled and generally
>>> manipulated in
>>> a consistent way.
>>
>> I can think of the following different mark-up approaches towards
>> solving this issue:
>>
>>
>> 1. Overload <track>
>>
>> For example synchronizing external audio description and sign language
>> video with main video:
>> <video id="v1" poster=“video.png” controls>
>>  <source src=“video.ogv” type=”video/ogg”>
>>  <source src=“video.mp4” type=”video/mp4”>
>>  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>  <track src="audesc.ogg" kind="descriptions" type="audio/ogg"
>> srclang="en" label="English Audio Description">
>>  <track src="signlang.ogv" kind="signings" type="video/ogg"
>> srclang="asl" label="American Sign Language">
>> </video>
>>
>> This adds a @type attribute to the <track> element, allowing it to
>> also be used with audio and video and not just text tracks.
>>
>> There are a number of problems with such an approach:
>>
>> * How do we reference alternative encodings?
>>   It would probably require the introduction of <source> elements
>> inside <track>, making <track> more complex for selecting currentSrc
>> etc. Also, if we needed different encodings for different devices, a
>> @media attribute will be necessary.
>>
>> * How do we control synchronization issues?
>>   The main resource would probably always be the one whose timeline
>> dominates and for the others we do a best effort to keep in sync with
>> that one. So, what happens if a user wants to not miss anything from
>> one of the auxiliary tracks, e.g. wants the sign language track to be
>> the time keeper? That's not possible with this approach.
>>
>> * How do we design the JavaScript API?
>>  There are no cues, so TimedTrack cues and activeCues would  be empty
>> elements and the cuechange would not ever be activated. The audio and
>> video tracks will be in the same TimedTrack list as the text ones and
>> possibly creating confusion for example in a accessibility menu for
>> track selection, in particular where the track @kind goes beyond mere
>> accessibility such as alternate viewing angles or director's comment.
>>
>> * What about other a/v related features, such as width/height and
>> placement of the sign language video or volume of the audio
>> description?
>>  Having control over such extra features would be rather difficult to
>> specify, since the data is only regarded as an abstract alternative
>> content to the main video. The rendering algorithm would become a lot
>> more complex and attributes from audio and video elements may be
>> necessary to introduce onto the <track> element, too. It seems that
>> would lead to quite some duplication of functionality between
>> different elements.
>>
>>
>> 2. Introduce <audiotrack> and <videotrack>
>>
>> Instead of overloading <track>, one could consider creating new track
>> elements for audio and video, such as <audiotrack> and <videotrack>.
>>
>> This allows keeping different attributes on these elements and having
>> audio / video / text track lists separate in JavaScript.
>>
>> Also, it allows for <source> elements inside <track> more easily, e.g.:
>> <video id="v1" poster=“video.png” controls>
>>  <source src=“video.ogv” type=”video/ogg”>
>>  <source src=“video.mp4” type=”video/mp4”>
>>  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>  <audiotrack kind=”descriptions” srclang=”en”>
>>    <source src=”description.ogg” type=”audio/ogg”>
>>    <source src=”description.mp3” type=”audio/mp3”>
>>  </audiotrack>
>> </video>
>> But fundamentally we have the same issues as with approach 1, in
>> particular a replication need of some of the audio / video
>> functionality from the <audio> and <video> elements.
>>
>>
>> 3. Introduce a <par>-like element
>>
>> The fundamental challenge that we are facing is to find a way to
>> synchronise multiple audio-visual media resources, be that from
>> in-band where the overall timeline is clear or be that with separate
>> external resources where the overall timeline has to be defined. Then
>> we are suddenly not talking any more about a master resource and
>> auxiliary resources, but audio-visual resources that are equals. This
>> is more along the SMIL way of thinking, which is why I called this
>> section the "<par>-like element".
>>
>> An example markup for synchronizing external audio description and
>> sign language video with a main video could then be something like:
>> <par>
>>  <video id="v1" poster=“video.png” controls>
>>    <source src=“video.ogv” type=”video/ogg”>
>>    <source src=“video.mp4” type=”video/mp4”>
>>    <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>    <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>    <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>  </video>
>>  <audio controls>
>>    <source src="audesc.ogg" type="audio/ogg">
>>    <source src="audesc.mp3" type="audio/mp3">
>>  </audio>
>>  <video controls>
>>    <source src="signing.ogv" type="video/ogg">
>>    <source src="signing.mp4" type="video/mp4">
>>  </video>
>> </par>
>>
>> This synchronisation element could of course be called something else:
>> <mastertime>, <coordinator>, <sync>, <timeline>, <container>, <timemaster>
>> etc.
>>
>> The synchronisation element needs to provide the main timeline. It
>> would make sure that the elements play and seek in parallel.
>>
>> Audio and video elements can then be styled individually as their own
>> CSS block elements and deactivated with "display: none".
>>
>> The sync element could have an attribute to decide whether to have
>> drop-outs in elements if the main timeline progresses, but some
>> contained elements starved, or whether to go into overall buffering
>> mode if one of the elements goes into buffering mode. It could also
>> define one as the main element whose timeline should not be ignored
>> and the others as slaves for which buffering situations would be
>> ignored. Something like @synchronize=[block/ignore] and @master="v1"
>> attributes.
>>
>> Also, a decision would need to be made about what to do with
>> @controls. Should there be a controls display on the first/master
>> element if any of them has a @controls attribute? Should the slave
>> elements not have controls displayed?
>>
>>
>> 4. Nest media elements
>>
>> An alternative means of re-using <audio> and <video> elements for
>> synchronisation is to put the "slave" elements inside the "master"
>> element like so:
>>
>> <video id="v1" poster=“video.png” controls>
>>  <source src=“video.ogv” type=”video/ogg”>
>>  <source src=“video.mp4” type=”video/mp4”>
>>  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>  <par>
>>    <audio controls>
>>      <source src="audesc.ogg" type="audio/ogg">
>>      <source src="audesc.mp3" type="audio/mp3">
>>    </audio>
>>    <video controls>
>>      <source src="signing.ogv" type="video/ogg">
>>      <source src="signing.mp4" type="video/mp4">
>>    </video>
>>  </par>
>> </video>
>>
>> This makes clear whose timeline the element is following. But it sure
>> looks recursive and we would have to define that elements inside a
>> <par> cannot have another <par> inside them to stop that.
>>
>> ===
>>
>> These are some of the thoughts I had on this topic. I am not yet
>> decided on which of the above proposals - or an alternative proposal -
>> makes the most sense. I have a gut feeling that it is probably useful
>> to be able to define both, a dominant container for synchronization
>> and one where all containers are valued the same. So, maybe the third
>> approach would be the most flexible, but it certainly needs a bit more
>> thinking.
>>
>> Cheers,
>> Silvia.
>>
>
> I think that if we want to synchronize several video tracks with non-trivial
> styling, then the only sensible option is to have multiple <video> elements
> which are linked together by some attribute. Otherwise we'd be limited to
> displaying one video over the other, or similar. A benefit of this approach
> is that it's easy to fake to within 100s of milliseconds in existing
> browsers, while <audiotrack> or nested <video>s would require more elaborate
> tricks to emulate (much like <track>).
>
> I can see the requirements on what to synchronize having a rather serious
> impact on the complexity. Mainly, these are the options:
>
> 1. Only synchronize tracks at their starting points, typically for extra
> audio tracks. This is very much like <track>.
>
> 2. Synchronize tracks at arbitrary offsets, including synchronizing the end
> of one track to the start of another. This is rather more SMIL-like.
>
> For option 1, something like this would do:
>
> <video id="bla"></video>
> <video sync="bla"></video>
>
> For option 2, things would be rather more complicated and I'm not going to
> make suggestions unless it's clear that we need it.

How would you suggest we solve the issue of audio descriptions that
are provided as separate audio files from the main video resource?
They have to be synchronized to the main video resource since the
spoken description is timed to be given at the exact times where the
main audio resource has pauses. I have previously made an experiment
at http://www.annodex.net/~silvia/itext/elephant_separate_audesc_dub.html
to demonstrate some of the involved issues, in particular the
synchronization problem. It can be solved with frequent re-sync-ing.
Is this not an acceptable use case for you?

Cheers,
Silvia.

Received on Thursday, 28 October 2010 10:07:12 UTC