- From: Marja-Riitta Koivunen <marja@w3.org>
- Date: Mon, 29 Nov 1999 10:37:40 -0500
- To: Wendy A Chisholm <wendy@w3.org>, w3c-wai-gl@w3.org
- Cc: w3c-wai-ua@w3.org
Thanks for your answer Wendy! I'm a little confused. Did you really mean that the text equivalent of a visual track can be an auditory description. I thought it would be text if it is text equivalent? If it is audio why is it then important to do automatic text-to-speech processing (read the text aload)? Marja At 10:03 AM 11/29/99 -0500, Wendy A Chisholm wrote: >Hello Marja, > > >>Checkpoint: >>1.3 Until user agents can automatically read aloud the text equivalent of a >>visual track, provide an auditory description of the important information >>of the visual track of a multimedia presentation. [Priority 1] Synchronize >>the auditory description with the audio track as per checkpoint 1.4. Refer >>to checkpoint 1.1 for information about textual equivalents for visual >>information. Techniques for checkpoint 1.3 >> >>Questions: >>I was trying to think how this checkpoint could be implemented >>in the user agent. > > > >>First question is what the author actually provides when >>he provides the text equivalent of a visual track? It seems that it is >>something that can be used to create auditory description. So it needs to >>be a continuous text stream that is synchronized to the video as it is >>describing the contents of the video. > >the text equivalent of a visual track is what WGBH/NCAM call "Descriptive >Video." This is an auditory track that describes visual information during >breaks in dialog and other auditory events. Examples are available from >the NCAM web site [1]. > >If it is recorded by a human, then it can be a secondary audio track or it >can be broken up into multiple files and played on cue (as with SMIL. see >examples from NCAM's beta MagPie tool). If it is synthesized by a machine >then the text can be synthesized to speech on cue. > >>Why would someone create such a textstream? A collated text transcript >>that can be read independently from the video would make more sense to me. > >I think you are assuming that the textstream appears as text and not >spoken. If spoken, the auditory description can help a person who can't >see the video stream by giving them the visual cues they are missing. > >>If a user can see text why not look the video rather than the description? > >the text should be spoken because they can't see the video. > >>Are there users that have hard time interpreting the video and that's why? > >this will be another beneficial use, but primarily it is to be used by >people who can not see the video. > >>When the device does not have a screen where to show the video it seems >>that collated text makes more sense. > >yes. there are cases where the collated text transcript does make >sense. the future ideal is for there to be one text document and the >appropriate pieces are synchronized or synthesized on cue. > >>A textstream need to be synchronized so that there is enough time to read >>it. The synchronization of the text that is visually read might be >>different for text than for the automatically created audio. So to >>automatically create an audio description based on text stream that is >>synchronized in a right way with audio might be difficult. Especially as >>the synchronization might change the timing of the original video as well >>if the natural pauses are not long enough to include the audio >>descriptions. Is there any ideas how the synchronization is created >>automatically from the textstream? > >I'm not sure what amount of work has been done about this, but I know it >has been talked about. One of the exciting things about digital >presentations is that the visual presentation could be paused during a long >auditory description. Currently, auditory descriptions are recorded by >humans to fit during the appropriate pauses in dialog (and other auditory >events). This often means that not all of the info is given or that it is >given much earlier than the actual event. > > >>Shortly: What does the author actually provide as text equivalent and how >>should the UA or the media player create the audio description from that? > >the checkpoint says, "until user agents..." Therefore TODAY, the author >has to provide the prerecorded auditory description in an additional audio >track. In the future, the author should either provide a secondary audio >track or a text transcript with time codes. A user agent could take this >information and send it to a speech synthesizer on cue. This would be >similar to how SMIL presentations can currently show captions on cue. > >does this help? >--wendy > >[1] http://www.wgbh.org/wgbh/pages/ncam/webaccess/captionedmovies.html ><> >wendy a chisholm (wac) >world wide web consortium (w3c) >web accessibility initiative (wai) >madison, wisconsin (madcity, wi) >united states of america (usa) >tel: +1 608 663 6346 ></> >
Received on Monday, 29 November 1999 10:38:55 UTC