Re: Synthesized-speech auditory descriptions from Brad Botkin on 2000-10-27 (www-smil@w3.org from October to December 2000)

From: Brad Botkin <brad_botkin@wgbh.org>
Date: Fri, 27 Oct 2000 08:41:29 -0400
To: geoff freed <geoff_freed@wgbh.org>
CC: "Hansen, Eric" <ehansen@ets.org>, www-smil@w3.org, thierry michel <tmichel@w3.org>, www-smil-request@w3.org
Message-ID: <39F977F9.43C6224D@wgbh.org>
Geoff,
True but incomplete.  It sounds like Eric is asking for a tag which identifies text as a transcription of the underlying
audio.   Something like:

<par>
.....
    <audio    systemAudioDesc="on"
                    AudioDescText="The lady in the pink sweater picks up the pearl necklace from the table and walks to the
door."
                    src="snippet8043.wav"/>
.....
</par>

It's a great idea, since the text is super-thin, making it appropriate for transmission in narrow pipes with local
text-to-speech synthesis for playback.  Note that the volume of snippets in a longer piece, like a movie, is huge, just
like closed captions.  Inclusion of 1000 audio description snippets and 2000 closed captions, each in 3 languages, each
with its own timecode, all in the same SMIL file will make for some *very* unfriendly  files.  Better would be to provide a
mechanism which allows the SMIL file to gracefully point to separate files each containing the timecoded AD snippets (with
transcriptions per the above) and timecoded captions.  It requires the SMIL player to gracefully overlay the external
timeline onto the intrinsic timeline of the SMIL file.  Without this, SMIL won't be used for interchange of caption and
description data for anything longer than a minute or two.  A translation house shouldn't have to unwind a bazillion audio
descriptions and captions in umpteen other languages to insert its French translation.

Regards,
--Brad
\_\_\_\_\_\_\_\_\_\_\_
Brad_Botkin@wgbh.org   Director, Technology & Systems Development
(v/f) 617.300.3902               NCAM/WGBH - National Center for
125 Western Ave Boston MA 02134              Accessible Media
\_\_\_\_\_\_\_\_\_\_\_


geoff freed wrote:

> Hi, Eric:
>
> SMIL 2.0 provides support for audio descriptions via a test attribute, systemAudioDesc.  The author can record audio
>  descriptions digitally and synchronize them into a SMIL presentation using this attribute, similar to how captions are
>  synchronized into SMIl presentations using systemCaptions (or system-captions, as it is called in SMIL 1.0).
>
> Additionally, using SMIL2.0's <excl> and <priorityClass> elements, the the author may pause a video track
>  automatically, play an extended audio description and, when the description is finished, resume playing the video
>  track.  This will be a boon for situations  where the natural pauses in the program audio aren't sufficient for audio
>  descriptions.
>
> Geoff Freed
> CPB/WGBH National Center for Accessible Media (NCAM)
> WGBH Educational Foundation
> geoff_freed@wgbh.org
>
> On Wednesday, October 25, 2000, thierry michel <tmichel@w3.org> wrote:
> >
> >> My questions concern the use of SMIL for developing auditory descriptions
> >> for multimedia presentations.
> >>
> >> The Web Content Accessibility Guidelines (WCAG) version 1.0 of W3C/WAI
> >> indicates the possibility of using speech synthesis for providing auditory
> >> descriptions for multimedia presentations. Specifically, checkpoint 1.3 of
> >> WCAG 1.0 reads:
> >>
> >> "1.3 Until user agents can automatically read aloud the text equivalent of
> >a
> >> visual track, provide an auditory description of the important information
> >> of the visual track of a multimedia presentation. [Priority 1]
> >> Synchronize the auditory description with the audio track as per
> >checkpoint
> >> 1.4. Refer to checkpoint 1.1 for information about textual equivalents for
> >> visual information." (WCAG 1.0, checkpoint 1.3).
> >>
> >> In the same document in the definition of "Equivalent", we read:
> >>
> >> "One example of a non-text equivalent is an auditory description of the
> >key
> >> visual elements of a presentation. The description is either a prerecorded
> >> human voice or a synthesized voice (recorded or generated on the fly). The
> >> auditory description is synchronized with the audio track of the
> >> presentation, usually during natural pauses in the audio track. Auditory
> >> descriptions include information about actions, body language, graphics,
> >and
> >> scene changes."
> >>
> >> My questions are as follows:
> >>
> >> 1. Does SMIL 2.0 support the development of synthesized speech auditory
> >> descriptions?
> >>
> >> 2. If the answer to question #1 is "Yes", then briefly describe the
> >support
> >> that is provided.
> >>
> >> 3. If the answer to question #1 is "No", then please describe any plans
> >for
> >> providing such support in the future.
> >>
> >> Thanks very much for your consideration.
> >>
> >> - Eric G. Hansen
> >> Development Scientist
> >> Educational Testing Service (ETS)
> >> Princeton, NJ 08541
> >> ehansen@ets.org
> >> Co-Editor, W3C/WAI User Agent Accessibility Guidelines
> >>
> >
Received on Friday, 27 October 2000 08:42:28 UTC