Re: [media] what to do about "clean audio" from David Singer on 2011-05-19 (public-html-a11y@w3.org from May 2011)

From: David Singer <singer@apple.com>
Date: Thu, 19 May 2011 16:54:49 -0700
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: HTML Accessibility Task Force <public-html-a11y@w3.org>
Message-id: <8380016F-1CB0-4157-A437-7C13E44639CD@apple.com>
Hi

this might be longer than you want...

I think the terminology "clean audio" is appropriate, in that it conveys the truth -- this is the 'significant' audio component of the program, without distracting music/background-noise etc. Indeed 90% of the time that will indeed be speech, but occasionally it might be something else.

(Think new age video of whales, with the primary audio being whale-song, with schlocky baroque music in the background).

The challenge for me is finding a clean way to describe these situations.  Originally when everything was "in the multiplex" I had hoped we'd make this a "media player" level issue -- sources that could satisfy an accessibility need would be marked as "capable of clean audio", and then it was the (more complex) labelling and configuring of, say, the MP4 player that would be responsible for delivering "clean audio".  But we're doing it ourselves, for the most part, so this is not the case.

I agree we might be able to special-case this and have the web audio API handle it.  But it would be good if some general processing rules existed for "how do I deliver an experience that best matches needs X, Y and Z by examining the track tags" (without the UA/HTML engine really having to understand much about X, Y or Z or what they mean).

I think 'meeting a need' might happen a variety of ways:

NOTHING) the program naturally meets this accessibility need with nothing needing to be done;
  examples: programs that are repetitive stimulus safe;  when open captions are the only available video;  
ADD) there is an extra track that, when enabled, meets the need: 
   example: text track delivering captions
SWITCH) there is an alternative track, so the matching main track gets disabled, the alternative is enabled, and that meets the need;
   examples: open captions; audio description of video (where re-timing did not occur); a single audio track re-authored for clean audio; repetitive stimulus avoidance;
REDUCE) the main program consists of more than one track normally, and one or more are disabled to meet an accessibility need;
   example: the background audio and the clean audio are in separate tracks normally mixed together; the background is disabled for the user needing clean audio;

We could usefully do with a simple labelling technique that has a matching one-pass algorithm which, given a set of "user needs" and a set of tracks with their tags, ends up enabling and disabling the right set to meet those needs.

One not fully-formed proposal from the past is to mark tracks with one of
+X -- enable this track if you have need X
-X -- disable this track if you have need X
and have those operate against an 'initial state' of the main program enabling and disabling, or have role/kind labels for 'main' to label those tracks.

so, working through those cases above:
NOTHING)
  the main track says "+main +captions" -- enable me if you need main and/or captions, and you'll get both
ADD)
  the main track says "+main", and the captions track "+captions"
SWITCH)
  the main track says "+main -captions" and the captions track says "+captions"
REDUCE)
  the background track says "+main -cleanaudio" and the primary audio track says "+main +cleanaudio"

Missing in this discussion is what the combinatorial rule is (do 'disables' win over 'enables', are they 'or' or 'and', or do you process left-to-right and let the last one win, or...?)

There is also something slightly ugly here, in that if I re-author a program after first publishing it so that there is now a cleanaudio *alternative* available, I have to edit the tagging of the *main* audio track to say "-cleanaudio"; some think that is not very clean.

An alternative design would be to have indications for these four kinds of action (nothing, add, switch, reduce) and let the UA work out what the 'other' track is.  I think this is behind the thinking of labelling something as an 'alternative' -- that it's implicitly saying 'and if you turn this on, turn off the track labelled 'main' that has the same media type.





On May 18, 2011, at 17:36 , Silvia Pfeiffer wrote:

> In today's call we discussed what to do about the new information
> about "clean audio" as dug up by Sean.
> 
> We are considering the following and would like to ask for your
> thoughts / concerns / objections.
> 
> 
> 1. Dealing with "clean audio" in the Web Audio API
> 
> "Clean audio" is a technology that is being standardized by ETSI [1]
> and EBU [2] in Europe. It is being defined on the audio channel level,
> i.e. the level in which stereo is handled, too. For example, it
> defines that the center channel in a 5.1 audio channel mix would
> consist only of speech and could be amplified separately to all the
> other channels. It is important to understand that it is being defined
> on the audio channel level and not the audio track level. Therefore,
> trying to add a track kind of "clean audio" to the track API of HTML
> media elements is not going to work since this doesn't allow us to
> separately control the volume of the "clean audio" channel in
> comparison to the other channels.
> 
> The suggestion here is that this problem will be solved through the
> Web Audio API. The Web Audio API indeed has controls for separately
> addressing channels and changing the mix of different channels [3].
> Therefore, the "clean audio" use case seems to be appropriately solved
> through Web Audio API means.
> 
> 
> 2. Rename @track="clear audio" to @kind="speech"
> 
> While "clean audio" refers to a particular type of technology which
> works on the channel level, the idea of providing a separate stream of
> speech-only audio data that can be controlled seaprately in volume to
> the normal/background audio is appealing and also allows to achieve
> the "clean audio" effect, since it allows users to separately control
> the volume of a speech track. This is not the "clean audio"
> technology, so we should not use that same name for such a track.
> Therefore, the suggestion is to call such a track "speech", which
> seems semantically appropriate.
> 
> This will then require us to change the request in the @kind bug [4]
> from asking for @kind="clearaudio" to @kind="speech". It would also
> close the multitrack bug [5].
> 
> We plan to make those changes to these bugs after next week's phone
> call, so hereby ask for input from everyone. In particular if you know
> a "clean audio" expert, please ask him/her for advice on whether this
> is the right way to approach it.
> 
> Best Regards,
> Silvia.
> 
> [1] http://www.etsi.org/deliver/etsi_ts/101100_101199/101154/01.09.01_60/ts_101154v010901p.pdf
> [2] http://tech.ebu.ch/docs/tech/tech3333.pdf
> [3] http://www.w3.org/2011/audio/
> [4] http://www.w3.org/Bugs/Public/show_bug.cgi?id=12544
> [5] http://www.w3.org/Bugs/Public/show_bug.cgi?id=11593
> 

David Singer
Multimedia and Software Standards, Apple Inc.
Received on Thursday, 19 May 2011 23:55:21 UTC