Re: [media] handling multitrack audio / video from Silvia Pfeiffer on 2010-12-02 (public-html-a11y@w3.org from December 2010)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Thu, 2 Dec 2010 14:18:17 +1100
To: Geoff Freed <geoff_freed@wgbh.org>
Cc: "public-html-a11y@w3.org" <public-html-a11y@w3.org>, "Frank.Olivier@microsoft.com" <Frank.Olivier@microsoft.com>
Message-ID: <AANLkTimESuA-jvoP+umvU=ga4LacxmFCo1DKUm4uQRFb@mail.gmail.com>
Hi Geoff,

in which way are "open" (full-mix) audio descriptions like you
describe them currently typically delivered?

1. as an extra track on the resource, i.e. the main resource has three
tracks: video, audio, full-mix audio

2. as a separate audio resource, i.e. there are two resources, one of
video & audio, and a separate full-mix audio

3. as a separate video resource, i.e. there are two resources, one of
video & audio, and a separate video & full-mix audio

I am asking, because the technical implications of these are rather different.

Cheers,
Silvia.


On Thu, Dec 2, 2010 at 1:18 PM, Geoff Freed <geoff_freed@wgbh.org> wrote:
>
> Apologies for a somewhat late response to this thread.
>
> I’d just like to drive in the point that it’s going to be very important to
> provide support for, and easy UI access to, embedded audio-description
> tracks.  Pre-recorded, human-narrated tracks are the norm now for described
> video, and having a mechanism for supporting and accessing those additional
> audio tracks is necessary in order to support described broadcast video that
> is moved to the Web.  Pre-recorded descriptions aren’t going to go away any
> time soon, even when TTS support is possible.
>
> One other thing to note: the current practice is to supply a single audio
> track that contains both the program audio as well as the descriptions.
>  This full-mix track was necessary in analog broadcasts and is necessary now
> on the Web because most multimedia players do not support the playback of
> two tracks simultaneously (one being the the program audio and the other
> being only the descriptions).  Therefore, to support existing described
> broadcast video that is moved to the Web, we’re going to need to provide the
> capability (or the option) of toggling between a full-mix track and an
> undescribed track, when both are available.  (This will be in addition to
> the new option of playing two tracks (program audio + descriptions)
> simultaneously.)
>
> Geoff Freed
> WGBH/NCAM
>
>
> ========
>
>
> If that is indeed the solution we favor, we should move fast before
> any implementations of <track> are made and <track> is confirmed as a
> content-less element, seeing as we will need <source> elements inside
> <track> then.
>
> Also, a big problem with this approach is that we lose all the
> functionality that is otherwise available to audio and video
> resources, such as setting he volume, width, height, placement etc.
> For example, did you have any thoughts on how to do the display of
> sign language video in this scenario?
>
> Cheers,
> Silvia.
>
> On Thu, Oct 28, 2010 at 11:37 AM, Frank Olivier
> <Frank.Olivier@microsoft.com> wrote:
>> Overall feedback from the Internet Explorer team:
>>
>> Option 1 - Overloading <track> and keeping the DOM parsing very simple
>> would be the most elegant way to proceed, with every track having a 'kind'
>> attribute - Generally, I expect to be able to activate a single alternate
>> track of a particular type (Also, this would also sync up well with a simple
>> javascript API - I would expect that I (as an author) would be able to
>> enumerate a flat list of alternate representations, with metadata that
>> indicates the kind.
>>
>> Additionally, it would be prudent to leave exact conformance (when
>> enabling/disabling alternate tracks) as a 'quality of implementation' issue
>> - For example, with multiple alternate audio tracks, there will certainly be
>> some (mobile) devices where the user agent is only able to play back one
>> audio or video track at a time due to hardware constraints; some devices may
>> only have enough screen real estate to display a single caption track.
>>
>> Text captioning and alternate audio tracks seem to be core requirements
>> for HTML5 at this point. The user requirements document does do a good job
>> of enumerating all issues that users face, but it will take significant time
>> to fully spec and prototype all features - I expect that user agents will be
>> able to implement the two areas mentioned; the features beyond this
>> certainly require a lot more discussion and speccing that is best addressed
>> in future versions of the HTML spec.
>>
>> Thanks
>> Frank Olivier
>>
>>
>> -----Original Message-----
>> From: public-html-a11y-request@w3.org
>> [mailto:public-html-a11y-request@w3.org] On Behalf Of Silvia Pfeiffer
>> Sent: Tuesday, October 19, 2010 10:52 AM
>> To: HTML Accessibility Task Force
>> Subject: [media] handling multitrack audio / video
>>
>> Hi all,
>>
>> This is to start a technical discussion on how to solve the multitrack
>> audio / video requirements in HTML5.
>>
>> We've got the following related bug and I want to make a start on
>> discussing the advantages / disadvantages of different approaches:
>> http://www.w3.org/Bugs/Public/show_bug.cgi?id=9452
>>
>> Ian's comment on this was this - and I agree that his conclusion should be
>> a general goal in the technical solution that we eventually
>> propose:
>>> The ability to control multiple internal media tracks (sign language
>>> video overlays, alternate angles, dubbed audio, etc) seems like
>>> something we'd want to do in a way consistent with handling of
>>> multiple external tracks, much like how internal subtitle tracks and
>>> external subtitle tracks should use the same mechanism so that they
>>> can be enabled and disabled and generally manipulated in a consistent
>>> way.
>>
>> I can think of the following different mark-up approaches towards solving
>> this issue:
>>
>>
>> 1. Overload <track>
>>
>> For example synchronizing external audio description and sign language
>> video with main video:
>> <video id="v1" poster="video.png" controls>
>>  <source src="video.ogv" type="video/ogg">
>>  <source src="video.mp4" type="video/mp4">
>>  <track kind="subtitles" srclang="fr" src="sub_fr.wsrt">
>>  <track kind="subtitles" srclang="ru" src="sub_ru.wsrt">
>>  <track kind="chapters" srclang="en" src="chapters.wsrt">
>>  <track src="audesc.ogg" kind="descriptions" type="audio/ogg"
>> srclang="en" label="English Audio Description">
>>  <track src="signlang.ogv" kind="signings" type="video/ogg"
>> srclang="asl" label="American Sign Language"> </video>
>>
>> This adds a @type attribute to the <track> element, allowing it to also be
>> used with audio and video and not just text tracks.
>>
>> There are a number of problems with such an approach:
>>
>> * How do we reference alternative encodings?
>>   It would probably require the introduction of <source> elements inside
>> <track>, making <track> more complex for selecting currentSrc etc. Also, if
>> we needed different encodings for different devices, a @media attribute will
>> be necessary.
>>
>> * How do we control synchronization issues?
>>   The main resource would probably always be the one whose timeline
>> dominates and for the others we do a best effort to keep in sync with that
>> one. So, what happens if a user wants to not miss anything from one of the
>> auxiliary tracks, e.g. wants the sign language track to be the time keeper?
>> That's not possible with this approach.
>>
>> * How do we design the JavaScript API?
>>  There are no cues, so TimedTrack cues and activeCues would  be empty
>> elements and the cuechange would not ever be activated. The audio and video
>> tracks will be in the same TimedTrack list as the text ones and possibly
>> creating confusion for example in a accessibility menu for track selection,
>> in particular where the track @kind goes beyond mere accessibility such as
>> alternate viewing angles or director's comment.
>>
>> * What about other a/v related features, such as width/height and
>> placement of the sign language video or volume of the audio description?
>>  Having control over such extra features would be rather difficult to
>> specify, since the data is only regarded as an abstract alternative content
>> to the main video. The rendering algorithm would become a lot more complex
>> and attributes from audio and video elements may be necessary to introduce
>> onto the <track> element, too. It seems that would lead to quite some
>> duplication of functionality between different elements.
>>
>>
>> 2. Introduce <audiotrack> and <videotrack>
>>
>> Instead of overloading <track>, one could consider creating new track
>> elements for audio and video, such as <audiotrack> and <videotrack>.
>>
>> This allows keeping different attributes on these elements and having
>> audio / video / text track lists separate in JavaScript.
>>
>> Also, it allows for <source> elements inside <track> more easily, e.g.:
>> <video id="v1" poster="video.png" controls>
>>  <source src="video.ogv" type="video/ogg">
>>  <source src="video.mp4" type="video/mp4">
>>  <track kind="subtitles" srclang="fr" src="sub_fr.wsrt">
>>  <track kind="subtitles" srclang="ru" src="sub_ru.wsrt">
>>  <track kind="chapters" srclang="en" src="chapters.wsrt">
>>  <audiotrack kind="descriptions" srclang="en">
>>    <source src="description.ogg" type="audio/ogg">
>>    <source src="description.mp3" type="audio/mp3">
>>  </audiotrack>
>> </video>
>> But fundamentally we have the same issues as with approach 1, in
>> particular a replication need of some of the audio / video functionality
>> from the <audio> and <video> elements.
>>
>>
>> 3. Introduce a <par>-like element
>>
>> The fundamental challenge that we are facing is to find a way to
>> synchronise multiple audio-visual media resources, be that from in-band
>> where the overall timeline is clear or be that with separate external
>> resources where the overall timeline has to be defined. Then we are suddenly
>> not talking any more about a master resource and auxiliary resources, but
>> audio-visual resources that are equals. This is more along the SMIL way of
>> thinking, which is why I called this section the "<par>-like element".
>>
>> An example markup for synchronizing external audio description and sign
>> language video with a main video could then be something like:
>> <par>
>>  <video id="v1" poster="video.png" controls>
>>    <source src="video.ogv" type="video/ogg">
>>    <source src="video.mp4" type="video/mp4">
>>    <track kind="subtitles" srclang="fr" src="sub_fr.wsrt">
>>    <track kind="subtitles" srclang="ru" src="sub_ru.wsrt">
>>    <track kind="chapters" srclang="en" src="chapters.wsrt">
>>  </video>
>>  <audio controls>
>>    <source src="audesc.ogg" type="audio/ogg">
>>    <source src="audesc.mp3" type="audio/mp3">
>>  </audio>
>>  <video controls>
>>    <source src="signing.ogv" type="video/ogg">
>>    <source src="signing.mp4" type="video/mp4">
>>  </video>
>> </par>
>>
>> This synchronisation element could of course be called something else:
>> <mastertime>, <coordinator>, <sync>, <timeline>, <container>, <timemaster>
>> etc.
>>
>> The synchronisation element needs to provide the main timeline. It would
>> make sure that the elements play and seek in parallel.
>>
>> Audio and video elements can then be styled individually as their own CSS
>> block elements and deactivated with "display: none".
>>
>> The sync element could have an attribute to decide whether to have
>> drop-outs in elements if the main timeline progresses, but some contained
>> elements starved, or whether to go into overall buffering mode if one of the
>> elements goes into buffering mode. It could also define one as the main
>> element whose timeline should not be ignored and the others as slaves for
>> which buffering situations would be ignored. Something like
>> @synchronize=[block/ignore] and @master="v1"
>> attributes.
>>
>> Also, a decision would need to be made about what to do with @controls.
>> Should there be a controls display on the first/master element if any of
>> them has a @controls attribute? Should the slave elements not have controls
>> displayed?
>>
>>
>> 4. Nest media elements
>>
>> An alternative means of re-using <audio> and <video> elements for
>> synchronisation is to put the "slave" elements inside the "master"
>> element like so:
>>
>> <video id="v1" poster="video.png" controls>
>>  <source src="video.ogv" type="video/ogg">
>>  <source src="video.mp4" type="video/mp4">
>>  <track kind="subtitles" srclang="fr" src="sub_fr.wsrt">
>>  <track kind="subtitles" srclang="ru" src="sub_ru.wsrt">
>>  <track kind="chapters" srclang="en" src="chapters.wsrt">
>>  <par>
>>    <audio controls>
>>      <source src="audesc.ogg" type="audio/ogg">
>>      <source src="audesc.mp3" type="audio/mp3">
>>    </audio>
>>    <video controls>
>>      <source src="signing.ogv" type="video/ogg">
>>      <source src="signing.mp4" type="video/mp4">
>>    </video>
>>  </par>
>> </video>
>>
>> This makes clear whose timeline the element is following. But it sure
>> looks recursive and we would have to define that elements inside a <par>
>> cannot have another <par> inside them to stop that.
>>
>> ===
>>
>> These are some of the thoughts I had on this topic. I am not yet decided
>> on which of the above proposals - or an alternative proposal - makes the
>> most sense. I have a gut feeling that it is probably useful to be able to
>> define both, a dominant container for synchronization and one where all
>> containers are valued the same. So, maybe the third approach would be the
>> most flexible, but it certainly needs a bit more thinking.
>>
>> Cheers,
>> Silvia.
>>
>>
Received on Thursday, 2 December 2010 03:19:13 UTC