Re: Questions about WCAG 1.3 from Wendy A Chisholm on 1999-11-29 (w3c-wai-gl@w3.org from October to December 1999)

From: Wendy A Chisholm <wendy@w3.org>
Date: Mon, 29 Nov 1999 10:03:51 -0500
To: Marja-Riitta Koivunen <marja@w3.org>, w3c-wai-gl@w3.org
Cc: w3c-wai-ua@w3.org
Message-Id: <4.2.0.58.19991129094823.00b62e90@localhost>
Hello Marja,


>Checkpoint:
>1.3 Until user agents can automatically read aloud the text equivalent of a
>visual track, provide an auditory description of the important information
>of the visual track of a multimedia presentation. [Priority 1]  Synchronize
>the auditory description with the audio track as per checkpoint 1.4. Refer
>to checkpoint 1.1 for information about textual equivalents for visual
>information.  Techniques for checkpoint 1.3
>
>Questions:
>I was trying to think how this checkpoint could be implemented
>in the user agent.



>First question is what the author actually provides when
>he provides the text equivalent of a visual track? It seems that it is
>something that can be used to create auditory description. So it needs to
>be a continuous text stream that is synchronized to the video as it is
>describing the contents of the video.

the text equivalent of a visual track is what WGBH/NCAM call "Descriptive 
Video."  This is an auditory track that describes visual information during 
breaks in dialog and other auditory events.   Examples are available from 
the NCAM web site [1].

If it is recorded by a human, then it can be a secondary audio track or it 
can be broken up into multiple files and played on cue (as with SMIL.  see 
examples from NCAM's beta MagPie tool).  If it is synthesized by a machine 
then the text can be synthesized to speech on cue.

>Why would someone create such a textstream? A collated text transcript
>that can be read independently from the video would make more sense to me.

I think you are assuming that the textstream appears as text and not 
spoken.  If spoken, the auditory description can help a person who can't 
see the video stream by giving them the visual cues they are missing.

>If a user can see text why not look the video rather than the description?

the text should be spoken because they can't see the video.

>Are there users that have hard time interpreting the video and that's why?

this will be another beneficial use, but primarily it is to be used by 
people who can not see the video.

>When the device does not have a screen where to show the video it seems
>that collated text makes more sense.

yes.  there are cases where the collated text transcript does make 
sense.  the future ideal is for there to be one text document and the 
appropriate pieces are synchronized or synthesized on cue.

>A textstream need to be synchronized so that there is enough time to read
>it. The synchronization of the text that is visually read might be
>different for text than for the automatically created audio. So to
>automatically create an audio description based on text stream that is
>synchronized in a right way with audio might be difficult. Especially as
>the synchronization might change the timing of the original video as well
>if the natural pauses are not long enough to include the audio
>descriptions. Is there any ideas how the synchronization is created
>automatically from the textstream?

I'm not sure what amount of work has been done about this, but I know it 
has been talked about.  One of the exciting things about digital 
presentations is that the visual presentation could be paused during a long 
auditory description.  Currently, auditory descriptions are recorded by 
humans to fit during the appropriate pauses in dialog (and other auditory 
events).  This often means that not all of the info is given or that it is 
given much earlier than the actual event.


>Shortly: What does the author actually provide as text equivalent and how
>should the UA or the media player create the audio description from that?

the checkpoint says, "until user agents..."  Therefore TODAY, the author 
has to provide the prerecorded auditory description in an additional audio 
track.  In the future, the author should either provide a secondary audio 
track or a text transcript with time codes.  A user agent could take this 
information and send it to a speech synthesizer on cue.  This would be 
similar to how SMIL presentations can currently show captions on cue.

does this help?
--wendy

[1]  http://www.wgbh.org/wgbh/pages/ncam/webaccess/captionedmovies.html
<>
wendy a chisholm (wac)
world wide web consortium (w3c)
web accessibility initiative (wai)
madison, wisconsin (madcity, wi)
united states of america (usa)
tel: +1 608 663 6346
</>
Received on Monday, 29 November 1999 09:57:17 UTC