Fwd: Format Requirements for Text Audio Descriptions (was Re: HTML5 TF from my team)

Not sure if this came through to the list, but here is a copy.
Silvia.


---------- Forwarded message ----------
From: Masatomo Kobayashi <MSTM@jp.ibm.com>
Date: 2010/5/5
Subject: Re: Format Requirements for Text Audio Descriptions (was Re:
HTML5 TF 	from my team)
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: Hironobu Takagi <TAKAGIH@jp.ibm.com>, John Foliot
<jfoliot@stanford.edu>, public-html-a11y@w3.org


Hi Silvia and John,

I am happy to join this discussion and sorry for the late response.
I have just read wikis you mentioned and recent discussions in the a11y TF ML.

I emailed Mike but I think I am not in the TF members list yet.
So please forward this reply to the ML, if necessary.



At this time, the additional requirements based on our research will include:
  a) behavior when overflowing
  b) extended audio descriptions
  c) support of SSML
If you have already discussed these topics, I would appreciate if you
could send me any related links.



a) Behavior when overflowing

The current proposals seem not to explicitly mention the case in which
the screen reader does not finish reading out a description sentence
by the 'end time'. This is likely to be caused by at least three
reasons:
- A typical author of textual audio descriptions does not have a
screen reader. This means s/he cannot check if the sentence is fit
within the time frame. Even if s/he has a screen reader, a different
screen reader may take longer to read out the same sentence;
- Some screen reader users (e.g., elderly and people with learning
disabilities) may slow down the speech rate; or
- A visually-complicated scene (e.g., figures on a blackboard in an
online physics class) may not be sufficiently described within any
time interval in the original audio track.

So, the specification should support to specify the behavior for this
case. The options will include:
- none -- continue to read out the sentence even after the end time.
This may overlap important information in the video.
- clip -- force to stop reading out the sentence at the end time. This
may cause the user to miss important information in the sentence.
- extend -- pause the video at the end time until the screen reader
finishes reading out the sentence. This may require an additional
mechanism beyond "aria-live: assertive", but at least our prototype
aiBrowser can do it.

This option might be able to be specified as an attribute:
  <track src="..." type="text/srt" role="textaudesc" overflow="extend"></track>
Or using CSS like the 'overflow' property for a visual element:
  track[role= textaudesc] {audio-overflow: extend}

For now only 'textaudesc' tracks need this mechanism, but in the
future other types of tracks, such as synthesized sign language, would
need to be covered.



b) Extended audio descriptions

As mentioned above, a visually-complicated scene may not be fully
described within a silent space in the original audio track. For that
case, guidelines for audio descriptions recommend using extended
descriptions. WCAG 2.0 (level AAA) also include it.

In our experiments, the use of extended descriptions made two
important advantages. First, it was nearly impossible to sufficiently
describe a kind of instructional video without extended descriptions.
Second, it allowed a novice describer to effectively describe at least
a short video because it did not require special skills to make an
appropriate description within a very limited time frame. I think
these advantages strongly encourage us to include extended
descriptions into the proposal.

If we have the 'overflow=extend' attribute introduced above, we will
be able to make an fully-extended description simply by specifying the
same time for both 'begin' and 'end' times.



c) Support of SSML

The SubRip srt format can produce minimal audio descriptions, and even
minimal descriptions can greatly help people who are blind or have low
vision.
But just like captions sometimes need TTML instead of srt, textual
audio descriptions may need SSML (and EmotionML) to produce richer
descriptions.
This includes the voice gender, speech rate, volume, prosody,
emotions, pre-recorded audio files, etc.

In our experiments, it was indicated that the speech quality for audio
descriptions is more critical than that for usual screen reading
because of two reasons. First, users often want to be relaxed when
watching a video. Second, they are forced to frequently switch their
attention between synthesized descriptions and human conversations in
the original audio track.

As common TTS engines such as Cepstral already support SSML, we can
consider it in the proposal. I think what a Web browser needs to do is
simply pass SSML to the TTS engine.



If you are interested in our prototype software that supports textual
audio descriptions, see:
  ScriptEditor, http://www.eclipse.org/actf/downloads/tools/ScriptEditor/
  aiBrowser, http://www.eclipse.org/actf/downloads/tools/aiBrowser/
The XML format we are using includes a boolean field to set extended
or not, speech rate, voice gender, and an optional pointer to a
pre-recorded alternative, for each sentence, in addition to time and
text.
I think that "extended" descriptions should be supported at least in
some way as described above. The latter three additional fields will
be covered by SSML. I am not sure they should be explicitly supported
if SSML is not included in the specification.

Please let me know if you have any questions and comments on my thoughts.

Regards,
Masatomo



Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote on 2010/04/27 07:41:48:

> From:
>
> Silvia Pfeiffer <silviapfeiffer1@gmail.com>
>
> To:
>
> Hironobu Takagi/Japan/IBM@IBMJP, Masatomo Kobayashi/Japan/IBM@IBMJP,
> John Foliot <jfoliot@stanford.edu>
>
> Cc:
>
> public-html-a11y@w3.org
>
> Date:
>
> 2010/04/27 07:42
>
> Subject:
>
> Format Requirements for Text Audio Descriptions (was Re: HTML5 TF
> from my team)
>
> On Tue, Apr 27, 2010 at 4:52 AM, John Foliot <jfoliot@stanford.edu> wrote:
> > Hironobu Takagi wrote:
> >>
> >> Finally, one of my team members will officially
> >> join the HTML5 Accessibility Task Force.
> >> Masatomo has been working for the audio description
> >> project, especially for the experiments in Japan
> >> and in US with WGBH.
> >> He can be the bridge between the open source authoring
> >> tool work and aiBrowser on Eclipse.org.
> >> He is now checking resources on the Web.
> >> If you have any suggestion for our involvement (beyond
> >> the mailing-list), please let us know.
> >> We are looking forward to working with you.
> >
> >
> > Hiro, this is great news! Hello and welcome to Masatomo. (For those
> > unaware or do not remember, Hiro presented IBM Research – Tokyo's work on
> > descriptive audio using synthesized voice at the Face-to-Face here at
> > Stanford last November, as well he and his team were at the CSUN
> > conference in March in San Diego. It is - IMHO - wicked cool! A Word Doc
> > can be found here:
> > http://www.letsgoexpo.com/utilities/File/viewfile.cfm?LCID=4091&eID=800002
> > 18 and perhaps Hiro you could point us to web-based [HTML] resources too?)
> >
> >
> >
> > Masatomo, you might want to start by reviewing the draft specifications
> > that are currently under discussion:
> >
> >        http://www.w3.org/WAI/PF/HTML/wiki/Media_MultitrackAPI
> > and
> >        http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations.
> >
> >
> > Silvia Pfeiffer recently wrote a blog post that is more easily readable as
> > an introduction, but as she notes it is not as technically accurate. It
> > can be found at
> > http://blog.gingertech.net/2010/04/11/introducing-media-accessibilit-into-
> > html5-media/.
> >
> > There has also been a fair bit of discussion recently about choosing one
> > or more appropriate time-stamp formats to be referenced in the HTML5
> > Specification/Standard - this discussion is very much up in the air at
> > this time.
> >
> >
> > As well, while not officially 'W3C', the WHATWG has started collecting
> > examples of time-aligned text displays (captions, subtitles, chapter
> > markers etc) and is extrapolating requirements from these in their wiki
> > at:
> >
> >        http://wiki.whatwg.org/wiki/Timed_tracks
> > http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video
> > _by_the_UA
> >
> > http://wiki.whatwg.org/wiki/Use_cases_for_API-level_access_to_timed_tracks
> >
> >
> > (I believe screen captures, etc. of your work with descriptive text would
> > be relevant here!)
> >
>
>
> Let me chime in here, since it is right now particularly relevant to
> you with textual audio descriptions.
>
> You will find at http://wiki.whatwg.org/wiki/Timed_tracks several
> mentions of "text audio descriptions".
>
> The assumption of that page is that textual audio descriptions do not
> require more than the following information in a format:
> * start time
> * end time
> * text
> * possibly the voice to chose to read back
>
> Are there any other requirements that you have come across in your
> work with textual audio descriptions? What do the files that you are
> using as input to your speech synthesis system for audio descriptions
> look like? Do they have any special fields that would need to be taken
> care of in a standardised storage format for textual audio
> descriptions?
>
> Cheers,
> Silvia.

Received on Tuesday, 4 May 2010 23:47:51 UTC