Re: [media] handling multitrack audio / video

On Thu, 28 Oct 2010 12:06:19 +0200, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> On Thu, Oct 28, 2010 at 6:12 PM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>> On Tue, 19 Oct 2010 19:51:31 +0200, Silvia Pfeiffer
>> <silviapfeiffer1@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> This is to start a technical discussion on how to solve the multitrack
>>> audio / video requirements in HTML5.
>>>
>>> We've got the following related bug and I want to make a start on
>>> discussing the advantages / disadvantages of different approaches:
>>> http://www.w3.org/Bugs/Public/show_bug.cgi?id=9452
>>>
>>> Ian's comment on this was this - and I agree that his conclusion
>>> should be a general goal in the technical solution that we eventually
>>> propose:
>>>>
>>>> The ability to control multiple internal media tracks (sign language
>>>> video
>>>> overlays, alternate angles, dubbed audio, etc) seems like something  
>>>> we'd
>>>> want
>>>> to do in a way consistent with handling of multiple external tracks,  
>>>> much
>>>> like
>>>> how internal subtitle tracks and external subtitle tracks should use  
>>>> the
>>>> same
>>>> mechanism so that they can be enabled and disabled and generally
>>>> manipulated in
>>>> a consistent way.
>>>
>>> I can think of the following different mark-up approaches towards
>>> solving this issue:
>>>
>>>
>>> 1. Overload <track>
>>>
>>> For example synchronizing external audio description and sign language
>>> video with main video:
>>> <video id="v1" poster=“video.png” controls>
>>>  <source src=“video.ogv” type=”video/ogg”>
>>>  <source src=“video.mp4” type=”video/mp4”>
>>>  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>>  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>>  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>>  <track src="audesc.ogg" kind="descriptions" type="audio/ogg"
>>> srclang="en" label="English Audio Description">
>>>  <track src="signlang.ogv" kind="signings" type="video/ogg"
>>> srclang="asl" label="American Sign Language">
>>> </video>
>>>
>>> This adds a @type attribute to the <track> element, allowing it to
>>> also be used with audio and video and not just text tracks.
>>>
>>> There are a number of problems with such an approach:
>>>
>>> * How do we reference alternative encodings?
>>>   It would probably require the introduction of <source> elements
>>> inside <track>, making <track> more complex for selecting currentSrc
>>> etc. Also, if we needed different encodings for different devices, a
>>> @media attribute will be necessary.
>>>
>>> * How do we control synchronization issues?
>>>   The main resource would probably always be the one whose timeline
>>> dominates and for the others we do a best effort to keep in sync with
>>> that one. So, what happens if a user wants to not miss anything from
>>> one of the auxiliary tracks, e.g. wants the sign language track to be
>>> the time keeper? That's not possible with this approach.
>>>
>>> * How do we design the JavaScript API?
>>>  There are no cues, so TimedTrack cues and activeCues would  be empty
>>> elements and the cuechange would not ever be activated. The audio and
>>> video tracks will be in the same TimedTrack list as the text ones and
>>> possibly creating confusion for example in a accessibility menu for
>>> track selection, in particular where the track @kind goes beyond mere
>>> accessibility such as alternate viewing angles or director's comment.
>>>
>>> * What about other a/v related features, such as width/height and
>>> placement of the sign language video or volume of the audio
>>> description?
>>>  Having control over such extra features would be rather difficult to
>>> specify, since the data is only regarded as an abstract alternative
>>> content to the main video. The rendering algorithm would become a lot
>>> more complex and attributes from audio and video elements may be
>>> necessary to introduce onto the <track> element, too. It seems that
>>> would lead to quite some duplication of functionality between
>>> different elements.
>>>
>>>
>>> 2. Introduce <audiotrack> and <videotrack>
>>>
>>> Instead of overloading <track>, one could consider creating new track
>>> elements for audio and video, such as <audiotrack> and <videotrack>.
>>>
>>> This allows keeping different attributes on these elements and having
>>> audio / video / text track lists separate in JavaScript.
>>>
>>> Also, it allows for <source> elements inside <track> more easily, e.g.:
>>> <video id="v1" poster=“video.png” controls>
>>>  <source src=“video.ogv” type=”video/ogg”>
>>>  <source src=“video.mp4” type=”video/mp4”>
>>>  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>>  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>>  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>>  <audiotrack kind=”descriptions” srclang=”en”>
>>>    <source src=”description.ogg” type=”audio/ogg”>
>>>    <source src=”description.mp3” type=”audio/mp3”>
>>>  </audiotrack>
>>> </video>
>>> But fundamentally we have the same issues as with approach 1, in
>>> particular a replication need of some of the audio / video
>>> functionality from the <audio> and <video> elements.
>>>
>>>
>>> 3. Introduce a <par>-like element
>>>
>>> The fundamental challenge that we are facing is to find a way to
>>> synchronise multiple audio-visual media resources, be that from
>>> in-band where the overall timeline is clear or be that with separate
>>> external resources where the overall timeline has to be defined. Then
>>> we are suddenly not talking any more about a master resource and
>>> auxiliary resources, but audio-visual resources that are equals. This
>>> is more along the SMIL way of thinking, which is why I called this
>>> section the "<par>-like element".
>>>
>>> An example markup for synchronizing external audio description and
>>> sign language video with a main video could then be something like:
>>> <par>
>>>  <video id="v1" poster=“video.png” controls>
>>>    <source src=“video.ogv” type=”video/ogg”>
>>>    <source src=“video.mp4” type=”video/mp4”>
>>>    <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>>    <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>>    <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>>  </video>
>>>  <audio controls>
>>>    <source src="audesc.ogg" type="audio/ogg">
>>>    <source src="audesc.mp3" type="audio/mp3">
>>>  </audio>
>>>  <video controls>
>>>    <source src="signing.ogv" type="video/ogg">
>>>    <source src="signing.mp4" type="video/mp4">
>>>  </video>
>>> </par>
>>>
>>> This synchronisation element could of course be called something else:
>>> <mastertime>, <coordinator>, <sync>, <timeline>, <container>,  
>>> <timemaster>
>>> etc.
>>>
>>> The synchronisation element needs to provide the main timeline. It
>>> would make sure that the elements play and seek in parallel.
>>>
>>> Audio and video elements can then be styled individually as their own
>>> CSS block elements and deactivated with "display: none".
>>>
>>> The sync element could have an attribute to decide whether to have
>>> drop-outs in elements if the main timeline progresses, but some
>>> contained elements starved, or whether to go into overall buffering
>>> mode if one of the elements goes into buffering mode. It could also
>>> define one as the main element whose timeline should not be ignored
>>> and the others as slaves for which buffering situations would be
>>> ignored. Something like @synchronize=[block/ignore] and @master="v1"
>>> attributes.
>>>
>>> Also, a decision would need to be made about what to do with
>>> @controls. Should there be a controls display on the first/master
>>> element if any of them has a @controls attribute? Should the slave
>>> elements not have controls displayed?
>>>
>>>
>>> 4. Nest media elements
>>>
>>> An alternative means of re-using <audio> and <video> elements for
>>> synchronisation is to put the "slave" elements inside the "master"
>>> element like so:
>>>
>>> <video id="v1" poster=“video.png” controls>
>>>  <source src=“video.ogv” type=”video/ogg”>
>>>  <source src=“video.mp4” type=”video/mp4”>
>>>  <track kind=”subtitles” srclang=”fr” src=”sub_fr.wsrt”>
>>>  <track kind=”subtitles” srclang=”ru” src=”sub_ru.wsrt”>
>>>  <track kind=”chapters” srclang=”en” src=”chapters.wsrt”>
>>>  <par>
>>>    <audio controls>
>>>      <source src="audesc.ogg" type="audio/ogg">
>>>      <source src="audesc.mp3" type="audio/mp3">
>>>    </audio>
>>>    <video controls>
>>>      <source src="signing.ogv" type="video/ogg">
>>>      <source src="signing.mp4" type="video/mp4">
>>>    </video>
>>>  </par>
>>> </video>
>>>
>>> This makes clear whose timeline the element is following. But it sure
>>> looks recursive and we would have to define that elements inside a
>>> <par> cannot have another <par> inside them to stop that.
>>>
>>> ===
>>>
>>> These are some of the thoughts I had on this topic. I am not yet
>>> decided on which of the above proposals - or an alternative proposal -
>>> makes the most sense. I have a gut feeling that it is probably useful
>>> to be able to define both, a dominant container for synchronization
>>> and one where all containers are valued the same. So, maybe the third
>>> approach would be the most flexible, but it certainly needs a bit more
>>> thinking.
>>>
>>> Cheers,
>>> Silvia.
>>>
>>
>> I think that if we want to synchronize several video tracks with  
>> non-trivial
>> styling, then the only sensible option is to have multiple <video>  
>> elements
>> which are linked together by some attribute. Otherwise we'd be limited  
>> to
>> displaying one video over the other, or similar. A benefit of this  
>> approach
>> is that it's easy to fake to within 100s of milliseconds in existing
>> browsers, while <audiotrack> or nested <video>s would require more  
>> elaborate
>> tricks to emulate (much like <track>).
>>
>> I can see the requirements on what to synchronize having a rather  
>> serious
>> impact on the complexity. Mainly, these are the options:
>>
>> 1. Only synchronize tracks at their starting points, typically for extra
>> audio tracks. This is very much like <track>.
>>
>> 2. Synchronize tracks at arbitrary offsets, including synchronizing the  
>> end
>> of one track to the start of another. This is rather more SMIL-like.
>>
>> For option 1, something like this would do:
>>
>> <video id="bla"></video>
>> <video sync="bla"></video>
>>
>> For option 2, things would be rather more complicated and I'm not going  
>> to
>> make suggestions unless it's clear that we need it.
>
> How would you suggest we solve the issue of audio descriptions that
> are provided as separate audio files from the main video resource?
> They have to be synchronized to the main video resource since the
> spoken description is timed to be given at the exact times where the
> main audio resource has pauses. I have previously made an experiment
> at http://www.annodex.net/~silvia/itext/elephant_separate_audesc_dub.html
> to demonstrate some of the involved issues, in particular the
> synchronization problem. It can be solved with frequent re-sync-ing.

Assuming that the external audio track has the same timeline as the main  
video file, then something like this:

<video id="v" src="video.webm"></video>
<audio sync="v" src="description.webm"></audio>

This wouldn't be particularly hard to implement, I think, most of the  
complexity seems to be in what the state of the video should be when the  
audio has stopped for buffering, and vice versa.

> Is this not an acceptable use case for you?

Perhaps I was unclear, if we supporting syncing separate <audio>/<video>  
elements, then the above is actually the very least we could do. It's  
beyond this most basic case I'd like to understand the actual use cases.  
To clarify, option 2 would allow things like this, borrowing SMIL syntax  
as seen in SVG:

<video id="v" src="video.webm"></video>
<video begin="v.begin+10s" src="video2.webm"></video>
<!-- video and video2 should be synchronized with a 10s offset -->

or

<video id="v" src="video.webm"></video>
<video begin="v.end" src="video2.webm"></video>
<!-- video and video2 should play gapless back-to-back -->

Are there compelling reasons to complicate things to this extent? The last  
example could be abused to achieve gapless playback between chunks in a  
HTTP live streaming setup, but I'm not a fan of the solution myself.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Thursday, 28 October 2010 11:06:34 UTC