Re: Format Requirements for Text Audio Descriptions (was Re: HTML5 TF from my team)

Hi Masatomo,

Thank you very much for this feedback. It is indeed very valuable and
highly appreciated.

I have some questions inline below.


2010/5/5 Masatomo Kobayashi <MSTM@jp.ibm.com>:
>
> At this time, the additional requirements based on our research will
> include:
>   a) behavior when overflowing
>   b) extended audio descriptions
>   c) support of SSML
> If you have already discussed these topics, I would appreciate if you could
> send me any related links.
>
>
>
> a) Behavior when overflowing
>
> The current proposals seem not to explicitly mention the case in which the
> screen reader does not finish reading out a description sentence by the 'end
> time'. This is likely to be caused by at least three reasons:
> - A typical author of textual audio descriptions does not have a screen
> reader. This means s/he cannot check if the sentence is fit within the time
> frame. Even if s/he has a screen reader, a different screen reader may take
> longer to read out the same sentence;
> - Some screen reader users (e.g., elderly and people with learning
> disabilities) may slow down the speech rate; or
> - A visually-complicated scene (e.g., figures on a blackboard in an online
> physics class) may not be sufficiently described within any time interval in
> the original audio track.
>
> So, the specification should support to specify the behavior for this case.
> The options will include:
> - none -- continue to read out the sentence even after the end time. This
> may overlap important information in the video.
> - clip -- force to stop reading out the sentence at the end time. This may
> cause the user to miss important information in the sentence.
> - extend -- pause the video at the end time until the screen reader finishes
> reading out the sentence. This may require an additional mechanism beyond
> "aria-live: assertive", but at least our prototype aiBrowser can do it.
>
> This option might be able to be specified as an attribute:
>   <track src="..." type="text/srt" role="textaudesc"
> overflow="extend"></track>
> Or using CSS like the 'overflow' property for a visual element:
>   track[role= textaudesc] {audio-overflow: extend}
>
> For now only 'textaudesc' tracks need this mechanism, but in the future
> other types of tracks, such as synthesized sign language, would need to be
> covered.


This is an interesting requirement.

Thinking about it in more depth, we may even want to use such an
attribute on captions and subtitles. It would indicate what will
happen if caption elements overlap into the next caption text cue, ie.
just display both (which would be the default) or clip the cue.
Pausing the video probably doesn't make sense for caption text.

I like this attribute.


> b) Extended audio descriptions
>
> As mentioned above, a visually-complicated scene may not be fully described
> within a silent space in the original audio track. For that case, guidelines
> for audio descriptions recommend using extended descriptions. WCAG 2.0
> (level AAA) also include it.
>
> In our experiments, the use of extended descriptions made two important
> advantages. First, it was nearly impossible to sufficiently describe a kind
> of instructional video without extended descriptions. Second, it allowed a
> novice describer to effectively describe at least a short video because it
> did not require special skills to make an appropriate description within a
> very limited time frame. I think these advantages strongly encourage us to
> include extended descriptions into the proposal.
>
> If we have the 'overflow=extend' attribute introduced above, we will be able
> to make an fully-extended description simply by specifying the same time for
> both 'begin' and 'end' times.


Yeah, I agree, the overflow attribute would satisfy this requirement.


> c) Support of SSML
>
> The SubRip srt format can produce minimal audio descriptions, and even
> minimal descriptions can greatly help people who are blind or have low
> vision.
> But just like captions sometimes need TTML instead of srt, textual audio
> descriptions may need SSML (and EmotionML) to produce richer descriptions.
> This includes the voice gender, speech rate, volume, prosody, emotions,
> pre-recorded audio files, etc.
>
> In our experiments, it was indicated that the speech quality for audio
> descriptions is more critical than that for usual screen reading because of
> two reasons. First, users often want to be relaxed when watching a video.
> Second, they are forced to frequently switch their attention between
> synthesized descriptions and human conversations in the original audio
> track.
>
> As common TTS engines such as Cepstral already support SSML, we can consider
> it in the proposal. I think what a Web browser needs to do is simply pass
> SSML to the TTS engine.


That would be one way to support it. Do you know if Web browsers
support SSML natively?

Also, there is Speech CSS (see http://www.w3.org/TR/css3-speech/),
which seems to provide for the same functionality. Have you
experimented with Speech CSS? Do you know if TTS engines support it?


Cheers,
Silvia.

Received on Wednesday, 5 May 2010 00:12:01 UTC