Re: Synthesized-speech auditory descriptions from Brad Botkin on 2000-10-29 (www-smil@w3.org from October to December 2000)

From: Brad Botkin <Brad_Botkin@wgbh.org>
Date: Sun, 29 Oct 2000 07:32:44 -0500
To: "Cohen, Aaron M" <aaron.m.cohen@intel.com>
CC: "'symm@w3.org'" <symm@w3.org>, geoff freed <geoff_freed@wgbh.org>, "Hansen, Eric" <ehansen@ets.org>, www-smil@w3.org, thierry michel <tmichel@w3.org>, www-smil-request@w3.org
Message-ID: <39FC18EC.E64961A4@wgbh.org>
Aaron,
What seems to be missing from

> > <par>
> >         <audio src="snippet8043.wav">
> >                 <description xml:lang="en">
> >                         The lady in the pink sweater picks up the pearl
> > necklace from the table and walks to the door.
> >                 <description/>
> >                 <description xml:lang="fr">
> >                         Oui.
> >                 <description/>
> >         </audio>
> > /par>

is a way to uniquely and unambiguously identify the text above as the
audio description (unless the <description> tag is just that, but I
assume "<description xml....>" here is a generic term unrelated to
"audio description" as we're talking about it).

The <systemAudioDesc> tag is a way to signal a player that some
particular content should be played for some users.  But the specific
rendering device has the job of deciding which media element to play,
the audio (uniquely identified by the "src" attribute) or the
transcription of that element (not yet uniquely identified).

The point is that there may be more than just one text string associated
with an audio element, only one of which is the transcription of that
audio.  <systemAudioDesc> *almost* spoke to this need, except that it
only takes an "on/off" value, which seems insufficient to the task of
allowing rendering engines to adequately handle accessibility issues. 
Since accessibility is being legislated in the tv and multimedia arena
as we speak, it seems prudent to create a set of extensible
accessibility tags which will allow those industries to easily utilize
SMIL in their workflow.  It's true that these elements would not be
general, reuseable ones, and I sympathize with your reticence to
generate more case markup. Nonetheless....

In another vein, how about the issue of how to manage the grouping of
synched accessibility objects (captions and descriptions, for example)
in separate text files.  I'm sure this is thorny, but the current
existing formats (RealText, SAMI, Quicktime qtText) all offer a way to
group these related elements (for captioning).  Current thoughts?

--Brad
\_\_\_\_\_\_\_\_\_\_\_
Brad_Botkin@wgbh.org   Director, Technology & Systems Development
(v/f) 617.300.3902               NCAM/WGBH - National Center for 
125 Western Ave Boston MA 02134              Accessible Media
\_\_\_\_\_\_\_\_\_\_\_



"Cohen, Aaron M" wrote:
> 
> Brad:
> That specific use of verbatim text is what systemAudioDesc is for. It can be
> used on text media elements that can contain the verbatim text. The pair of
> audio and text elements can be wrapped in a par and given a specific title,
> and the unit used in a presentation just like an individual media element.
> 
> Why would it be better to have special case markup when the generalized
> capabilities that we have cover the use cases?
> 
> Your example confuses me, since it doesn't seem to give any more capability
> than we already have with XHTML+SMIL:
> 
> <par>
>         <audio src="snippet8043.wav"/>
>         <p systemAudioDesc="on">The lady in the pink sweater picks up the
> pearl necklace from the table and walks to the door.</p>
> </par>
> 
> Even less, since you can't hang an xml:lang off the attribute, necessitating
> duplication of the media object reference for each langauge of the text
> description.
> 
> With SMIL 2.0, you have to put the text in alt or another file, because SMIL
> does not itself define media:
> <par>
>         <audio src="snippet8043.wav"/>
>         <text systemAudioDesc="on" src="lady.txt/>
> /par>
> 
> If you are saying that there should be some general scalable mechanism to
> make this easier to maintain, I agree with you, with the additional
> stipulation that this is not just a smil issue, but an issue for all XML
> languages that have non-text content.
> 
> For the next version of SMIL, we plan to adopt SVG's description element,
> which would allow you to do something like this in SMIL:
> 
> <par>
>         <audio src="snippet8043.wav">
>                 <description xml:lang="en">
>                         The lady in the pink sweater picks up the pearl
> necklace from the table and walks to the door.
>                 <description/>
>                 <description xml:lang="fr">
>                         Oui.
>                 <description/>
>         </audio>
> /par>
> 
> Having an attribute on elements that are specially meant to be a literal
> text translation of (possibly long) media does not scale well. The sub
> elements make more sense.
> 
> I think that this is the beginning of discussion about the need to create a
> set of reusable markup elements that fit the indentified needs. I can
> imagine <description>, <transcription>, and <title> child elements, all
> enclosing text.
> 
> My point is that these are real problems that need solutions, but the
> solutions need to be general, reusable and thought out in detail. This will
> require some dedicated people and some time. This is way too late in the
> SMIL 2.0 process to start integrating this kind of thing into the language,
> but it is something that should be done for re-use by everyone and
> integrated into SMIL (and XHTML 2.0?, SVG?) in the future.
> 
> -Aaron
> 
> > -----Original Message-----
> > From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> > Sent: Friday, October 27, 2000 12:30 PM
> > To: Cohen, Aaron M
> > Cc: geoff freed; Hansen, Eric; www-smil@w3.org; thierry michel;
> > www-smil-request@w3.org
> > Subject: Re: Synthesized-speech auditory descriptions
> >
> >
> > Aaron,
> > I think the actual transcription of the audio deserves its own tag,
> > since it's so specific. For the same reason that you created a
> > systemAudioDesc tag and didn't just use the alt tag.  You need a place
> > to look that's consistent.  I believe the longdesc is intented to be
> > used as simply a longer text description of the unnderlying graphic or
> > media file. And in the case of audio description snippets,
> > the longdesc
> > could be used to hold timing or other metadata related to the snippet
> > but not specifically voiced. I think that verbatim text will prove
> > invaluable in the future, for searching, etc., and you should consider
> > creating a specific tag for this.
> > --Brad
> > \_\_\_\_\_\_\_\_\_\_
> > Brad_Botkin@wgbh.org   Director, Technology & Systems Development
> > 617.300.3902 (v/f)               NCAM/WGBH - National Center for
> > 125 Western Ave Boston MA 02134              Accessible Media
> > \_\_\_\_\_\_\_\_\_\_
> >
> >
> > "Cohen, Aaron M" wrote:
> > >
> > > Brad:
> > >
> > > We also have alt and longdesc, either of which could be
> > used by a player to
> > > provide accessory or alternative text content. These can be
> > combined with
> > > the systemLanguage and other test attributes to provide
> > many combinations of
> > > accessiblity and internationalization.
> > > -Aaron
> > >
> > > > -----Original Message-----
> > > > From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> > > > Sent: Friday, October 27, 2000 5:41 AM
> > > > To: geoff freed
> > > > Cc: Hansen, Eric; www-smil@w3.org; thierry michel;
> > > > www-smil-request@w3.org
> > > > Subject: Re: Synthesized-speech auditory descriptions
> > > >
> > > >
> > > > Geoff,
> > > > True but incomplete.  It sounds like Eric is asking for a tag
> > > > which identifies text as a transcription of the underlying
> > > > audio.   Something like:
> > > >
> > > > <par>
> > > > .....
> > > >     <audio    systemAudioDesc="on"
> > > >                     AudioDescText="The lady in the pink
> > > > sweater picks up the pearl necklace from the table and
> > walks to the
> > > > door."
> > > >                     src="snippet8043.wav"/>
> > > > .....
> > > > </par>
> > > >
> > > > It's a great idea, since the text is super-thin, making it
> > > > appropriate for transmission in narrow pipes with local
> > > > text-to-speech synthesis for playback.  Note that the volume
> > > > of snippets in a longer piece, like a movie, is huge, just
> > > > like closed captions.  Inclusion of 1000 audio description
> > > > snippets and 2000 closed captions, each in 3 languages, each
> > > > with its own timecode, all in the same SMIL file will make
> > > > for some *very* unfriendly  files.  Better would be to provide a
> > > > mechanism which allows the SMIL file to gracefully point to
> > > > separate files each containing the timecoded AD snippets (with
> > > > transcriptions per the above) and timecoded captions.  It
> > > > requires the SMIL player to gracefully overlay the external
> > > > timeline onto the intrinsic timeline of the SMIL file.
> > > > Without this, SMIL won't be used for interchange of caption and
> > > > description data for anything longer than a minute or two.  A
> > > > translation house shouldn't have to unwind a bazillion audio
> > > > descriptions and captions in umpteen other languages to
> > > > insert its French translation.
> > > >
> > > > Regards,
> > > > --Brad
> > > > \_\_\_\_\_\_\_\_\_\_\_
> > > > Brad_Botkin@wgbh.org   Director, Technology & Systems Development
> > > > (v/f) 617.300.3902               NCAM/WGBH - National Center for
> > > > 125 Western Ave Boston MA 02134              Accessible Media
> > > > \_\_\_\_\_\_\_\_\_\_\_
> > > >
> > > >
> > > > geoff freed wrote:
> > > >
> > > > > Hi, Eric:
> > > > >
> > > > > SMIL 2.0 provides support for audio descriptions via a test
> > > > attribute, systemAudioDesc.  The author can record audio
> > > > >  descriptions digitally and synchronize them into a SMIL
> > > > presentation using this attribute, similar to how captions are
> > > > >  synchronized into SMIl presentations using systemCaptions
> > > > (or system-captions, as it is called in SMIL 1.0).
> > > > >
> > > > > Additionally, using SMIL2.0's <excl> and <priorityClass>
> > > > elements, the the author may pause a video track
> > > > >  automatically, play an extended audio description and,
> > > > when the description is finished, resume playing the video
> > > > >  track.  This will be a boon for situations  where the
> > > > natural pauses in the program audio aren't sufficient for audio
> > > > >  descriptions.
> > > > >
> > > > > Geoff Freed
> > > > > CPB/WGBH National Center for Accessible Media (NCAM)
> > > > > WGBH Educational Foundation
> > > > > geoff_freed@wgbh.org
> > > > >
> > > > > On Wednesday, October 25, 2000, thierry michel
> > > > <tmichel@w3.org> wrote:
> > > > > >
> > > > > >> My questions concern the use of SMIL for developing
> > > > auditory descriptions
> > > > > >> for multimedia presentations.
> > > > > >>
> > > > > >> The Web Content Accessibility Guidelines (WCAG) version
> > > > 1.0 of W3C/WAI
> > > > > >> indicates the possibility of using speech synthesis for
> > > > providing auditory
> > > > > >> descriptions for multimedia presentations. Specifically,
> > > > checkpoint 1.3 of
> > > > > >> WCAG 1.0 reads:
> > > > > >>
> > > > > >> "1.3 Until user agents can automatically read aloud the
> > > > text equivalent of
> > > > > >a
> > > > > >> visual track, provide an auditory description of the
> > > > important information
> > > > > >> of the visual track of a multimedia presentation.
> > [Priority 1]
> > > > > >> Synchronize the auditory description with the audio
> > track as per
> > > > > >checkpoint
> > > > > >> 1.4. Refer to checkpoint 1.1 for information about
> > > > textual equivalents for
> > > > > >> visual information." (WCAG 1.0, checkpoint 1.3).
> > > > > >>
> > > > > >> In the same document in the definition of
> > "Equivalent", we read:
> > > > > >>
> > > > > >> "One example of a non-text equivalent is an auditory
> > > > description of the
> > > > > >key
> > > > > >> visual elements of a presentation. The description is
> > > > either a prerecorded
> > > > > >> human voice or a synthesized voice (recorded or
> > > > generated on the fly). The
> > > > > >> auditory description is synchronized with the audio
> > track of the
> > > > > >> presentation, usually during natural pauses in the audio
> > > > track. Auditory
> > > > > >> descriptions include information about actions, body
> > > > language, graphics,
> > > > > >and
> > > > > >> scene changes."
> > > > > >>
> > > > > >> My questions are as follows:
> > > > > >>
> > > > > >> 1. Does SMIL 2.0 support the development of synthesized
> > > > speech auditory
> > > > > >> descriptions?
> > > > > >>
> > > > > >> 2. If the answer to question #1 is "Yes", then briefly
> > > > describe the
> > > > > >support
> > > > > >> that is provided.
> > > > > >>
> > > > > >> 3. If the answer to question #1 is "No", then please
> > > > describe any plans
> > > > > >for
> > > > > >> providing such support in the future.
> > > > > >>
> > > > > >> Thanks very much for your consideration.
> > > > > >>
> > > > > >> - Eric G. Hansen
> > > > > >> Development Scientist
> > > > > >> Educational Testing Service (ETS)
> > > > > >> Princeton, NJ 08541
> > > > > >> ehansen@ets.org
> > > > > >> Co-Editor, W3C/WAI User Agent Accessibility Guidelines
> > > > > >>
> > > > > >
> > > >
> > > >
> >
> >
Received on Sunday, 29 October 2000 07:29:41 UTC