RE: [media] handling multitrack audio / video from Sean Hayes on 2010-12-06 (public-html-a11y@w3.org from December 2010)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Mon, 6 Dec 2010 14:01:43 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>, Maciej Stachowiak <mjs@apple.com>
CC: Eric Carlson <eric.carlson@apple.com>, Geoff Freed <geoff_freed@wgbh.org>, HTML Accessibility Task Force <public-html-a11y@w3.org>, Frank Olivier <Frank.Olivier@microsoft.com>
Message-ID: <8DEFC0D8B72E054E97DC307774FE4B91330D091F@DB3EX14MBXC313.europe.corp.microsoft.c>
I've been thinking quite a bit about this while trying to create accessible video page with HTML5.  While I would really prefer an external group go off and study a proper manifest file, and get all this complexity out of HTML5 and dealt with in a more independent fashion;  I have mostly been able to achieve the goals with what is there, but there are a few things that don't seem optimal yet.

 I'm becoming increasingly of the opinion that we should not try and create any specific processing algorithm for the selection of the additional resources, but simply provide adequate labeling (for human and machine) so that the UA can present the options to the user. Here I've presented a fairly modest set of changes that probably cover the majority of situations.

The kind attribute: As formulated, this seems too simplistic and not discriminating enough. For the kinds: captions, subtitles and text descriptions (when presented as text) and similar, the processing is largely the same and so the kinds here could be merged.  The user however might prefer to have subtitles and descriptions read out (spoken subtitles are become increasingly used in the EU), so different types of processing may be required for the same kinds. In some cases the user may want more than one presentation of a resource, e.g. both spoken and captions/descriptions displayed as braille.  Moreover the assumption that a resource contains only one kind seems somewhat flawed.

So kind is currently not a very good discriminator for the processing that the user might want on a resource. I propose we change it. In replacement, and absent requiring this kind of metadata in the actual resource, I propose a 'contains' attribute, which describes the types of resource present in the referenced media (there may be more than one) that the user can select; and that this apply to audio, video and text media tracks equally.
The kind (or content) attribute might contain
     resources:  (("open-")? type "-" mode S )*
     type:  "captions" |  "subtitles" | "signing" |"dubbing" | "description" | "extended-description" | "enhanced-captions " | ...
     mode:  " audio" |  "video" | "text" | ""

For media resources, the main lack in the spec today is any kind of "sync to" operation; but there are actually  multiple ways in which media resources can be combined:
a) replacement. Video B (E.g. with open subtitles) is a replacement for Video A (without).
b) partial replacement: Audio B (E.g. with pre-mixed descriptions) is a replacement for the audio track of Video A.
c) layering: Audio B (e.g with un mixed descriptions) is to be mixed with the audio track of video A
d) interstitial:  video (or audio) B is to be inserted in video A at a specific time, extending the playback (e.g extended descriptions).
e) Coexistance. Video B (e.g ASL translation) is to be played contemporaneously with Video A

Each of these has been used to provide an accessibility mode; (although maybe they don't all need to be in the spec). To capture these operations, I propose some set of combine-xx attributes for the track element which capture the potential use cases, e.g.:  combine-all combine-audio combine-video combine-text and combine (equivalent to combine all) attributes.

The track combination algorithm would happen after the source selection algorithm, and can use the choice made to guide any of its own source selections

Studying the cases above:

a) replacement. Using a second main resource as a means of providing captions is suboptimal, as the captions cannot then be styled to the users requirements (e.g large font, high contrast); but it is certainly used. This approach can be effective for both sign translated and audio described material, and as this seems like a relatively easy use case from a UA perspective, we should probably support it. The UA can create a set of choices from the @contains attribute  and also a languages tag which can take a set of language labels. So in cases where the main resource can have multiple choices, use a track element; but add the codec type attribute into the track

Example markup for an English movie which an option that has simultaneous open English captions and traditional and simplified Chinese subtitles and an option that has Chinese dubbed soundtrack:
<video ... >
         <source src="v1.mp4" ...  />
         <track type="video/mp4" lang="en" combine ="replace "  src="v1cc.mp4" 
                       contains="open-caption open-subtitle" languages="en zh-hant zh-hans "
                      label="English Captions and Chinese subtitles" />
         <track type="video/mp4" lang="zh" combine ="replace "   src="v1db.mp4" 
                      contains="open-dubbing" label="Chinese dubbed" />
</video>

Note here the assumption that the replacement is in the same encoding, in the next example the case of two codecs is dealt with.

b) partial replacement. Most media players support this for in band resources, support for out of band is harder, the main requirement here is synchronization and the ability to switch off the replaced track. 

E.g.
<video >
     <source src="v1.mp4"... />
     <source src="v1.ogg"... />
     <track contains="description-audio"  combine-audio ="replace " label="English descriptions" >
           <source src="v1ds.ogg" type="audio/ogg" />     
           <source src="v1ds.mp3" type="audio/mp3" />
      </track>
  </audio>
</video>

c) Layer. I think it is unlikely this option is going to be supported by many media players or browsers any time soon; but if included the markup would be as above with @combine-audio ="mix ". Picture in Picture type closed sign could be combine-video="overlay". The problem with this mode is all the attendant additional parameters like relative volume, positioning etc.

d) Interstitial. This is awkward to do using the existing media API, because of network delays and because it's a set of resources. But it's possible; I've done it by embedding the audio uri's in timed text, but there is a need to indicate whether the text resource is just "descriptions as text", "descriptions as audio" or both, and whether the audio should be mixed with or pause the media.
E.g.
<video >
     <source src="v1.mp4" label="main movie" />
     <track src=" captions.mu" label="English descriptions" type="text/mu"
                  contains="description-audio description-text" combine-audio ="intersperse"  combine-text ="overlay" /> 
</video>

Combine-text="overlay" is the default, and could probably be implied.  Whether options like combine-text=replace make any sense I'm not sure.

e) Coexistance. For additional videos, the easiest would be to have a whole other <video> element and use a syncto attribute; however this gets awkward as there is the possibility the author might embed tracks within it. So perhaps the simpler approach is to deal with it like audio and partial replacement above and embed it as a track, with implied synchronization to the selected source:
<video >
     <source src="v1.mp4" ... />
     <source src="v1.ogg" ... />
      <track combine ="coexist " contains="open-signing-video" lang="sgn-uk" label="BSL translation" >
           <source src="v1sgn.ogg" type="video/ogg" />     
           <source src="v1sgn.mp4" type="video/mp3" />
      </track>
  </audio>
</video>

For this to work it would require that the tracks in a media element potentially be independently visible in the display tree and that CSS be able to select the embedded track:
video > track[contains~="open-signing-video"] {
	...
}

I think that's a reasonably minimal set of changes to the existing text, whilst still covering the important use cases. I'll work through the matrix of combine vs contains options some more to see if the attribute set can be simplified any further. If we want to get fancier, particularly if we want to deal with adaptive streaming, then I do think it needs a media manifest WG.

Sean.
Received on Monday, 6 December 2010 14:02:48 UTC