Re: Questions about WCAG 1.3 from Marja-Riitta Koivunen on 1999-11-29 (w3c-wai-ua@w3.org from October to December 1999)

From: Marja-Riitta Koivunen <marja@w3.org>
Date: Mon, 29 Nov 1999 10:37:40 -0500
To: Wendy A Chisholm <wendy@w3.org>, w3c-wai-gl@w3.org
Cc: w3c-wai-ua@w3.org
Message-Id: <3.0.5.32.19991129103740.009e31d0@localhost>
Thanks for your answer Wendy!

I'm a little confused. Did you really mean that the text equivalent of a
visual track can be an auditory description. I thought it would be text if
it is text equivalent? If it is audio why is it then important to do
automatic text-to-speech processing (read the text aload)?

Marja


At 10:03 AM 11/29/99 -0500, Wendy A Chisholm wrote:
>Hello Marja,
>
>
>>Checkpoint:
>>1.3 Until user agents can automatically read aloud the text equivalent of a
>>visual track, provide an auditory description of the important information
>>of the visual track of a multimedia presentation. [Priority 1]  Synchronize
>>the auditory description with the audio track as per checkpoint 1.4. Refer
>>to checkpoint 1.1 for information about textual equivalents for visual
>>information.  Techniques for checkpoint 1.3
>>
>>Questions:
>>I was trying to think how this checkpoint could be implemented
>>in the user agent.
>
>
>
>>First question is what the author actually provides when
>>he provides the text equivalent of a visual track? It seems that it is
>>something that can be used to create auditory description. So it needs to
>>be a continuous text stream that is synchronized to the video as it is
>>describing the contents of the video.
>
>the text equivalent of a visual track is what WGBH/NCAM call "Descriptive 
>Video."  This is an auditory track that describes visual information during 
>breaks in dialog and other auditory events.   Examples are available from 
>the NCAM web site [1].
>
>If it is recorded by a human, then it can be a secondary audio track or it 
>can be broken up into multiple files and played on cue (as with SMIL.  see 
>examples from NCAM's beta MagPie tool).  If it is synthesized by a machine 
>then the text can be synthesized to speech on cue.
>
>>Why would someone create such a textstream? A collated text transcript
>>that can be read independently from the video would make more sense to me.
>
>I think you are assuming that the textstream appears as text and not 
>spoken.  If spoken, the auditory description can help a person who can't 
>see the video stream by giving them the visual cues they are missing.
>
>>If a user can see text why not look the video rather than the description?
>
>the text should be spoken because they can't see the video.
>
>>Are there users that have hard time interpreting the video and that's why?
>
>this will be another beneficial use, but primarily it is to be used by 
>people who can not see the video.
>
>>When the device does not have a screen where to show the video it seems
>>that collated text makes more sense.
>
>yes.  there are cases where the collated text transcript does make 
>sense.  the future ideal is for there to be one text document and the 
>appropriate pieces are synchronized or synthesized on cue.
>
>>A textstream need to be synchronized so that there is enough time to read
>>it. The synchronization of the text that is visually read might be
>>different for text than for the automatically created audio. So to
>>automatically create an audio description based on text stream that is
>>synchronized in a right way with audio might be difficult. Especially as
>>the synchronization might change the timing of the original video as well
>>if the natural pauses are not long enough to include the audio
>>descriptions. Is there any ideas how the synchronization is created
>>automatically from the textstream?
>
>I'm not sure what amount of work has been done about this, but I know it 
>has been talked about.  One of the exciting things about digital 
>presentations is that the visual presentation could be paused during a long 
>auditory description.  Currently, auditory descriptions are recorded by 
>humans to fit during the appropriate pauses in dialog (and other auditory 
>events).  This often means that not all of the info is given or that it is 
>given much earlier than the actual event.
>
>
>>Shortly: What does the author actually provide as text equivalent and how
>>should the UA or the media player create the audio description from that?
>
>the checkpoint says, "until user agents..."  Therefore TODAY, the author 
>has to provide the prerecorded auditory description in an additional audio 
>track.  In the future, the author should either provide a secondary audio 
>track or a text transcript with time codes.  A user agent could take this 
>information and send it to a speech synthesizer on cue.  This would be 
>similar to how SMIL presentations can currently show captions on cue.
>
>does this help?
>--wendy
>
>[1]  http://www.wgbh.org/wgbh/pages/ncam/webaccess/captionedmovies.html
><>
>wendy a chisholm (wac)
>world wide web consortium (w3c)
>web accessibility initiative (wai)
>madison, wisconsin (madcity, wi)
>united states of america (usa)
>tel: +1 608 663 6346
></>
>
Received on Monday, 29 November 1999 10:38:56 UTC