RE: Synthesized-speech auditory descriptions from Cohen, Aaron M on 2000-10-27 (www-smil@w3.org from October to December 2000)

From: Cohen, Aaron M <aaron.m.cohen@intel.com>
Date: Fri, 27 Oct 2000 13:24:56 -0700
To: "'Brad Botkin'" <brad_botkin@wgbh.org>, "'symm@w3.org'" <symm@w3.org>
Cc: geoff freed <geoff_freed@wgbh.org>, "Hansen, Eric" <ehansen@ets.org>, www-smil@w3.org, thierry michel <tmichel@w3.org>, www-smil-request@w3.org
Message-ID: <D5E932F578EBD111AC3F00A0C96B1E6F0626B2C6@orsmsx31.jf.intel.com>
Brad:
That specific use of verbatim text is what systemAudioDesc is for. It can be
used on text media elements that can contain the verbatim text. The pair of
audio and text elements can be wrapped in a par and given a specific title,
and the unit used in a presentation just like an individual media element.

Why would it be better to have special case markup when the generalized
capabilities that we have cover the use cases?

Your example confuses me, since it doesn't seem to give any more capability
than we already have with XHTML+SMIL:

<par>
	<audio src="snippet8043.wav"/>
	<p systemAudioDesc="on">The lady in the pink sweater picks up the
pearl necklace from the table and walks to the door.</p>
</par>

Even less, since you can't hang an xml:lang off the attribute, necessitating
duplication of the media object reference for each langauge of the text
description.

With SMIL 2.0, you have to put the text in alt or another file, because SMIL
does not itself define media:
<par>
	<audio src="snippet8043.wav"/>
	<text systemAudioDesc="on" src="lady.txt/>
/par>

If you are saying that there should be some general scalable mechanism to
make this easier to maintain, I agree with you, with the additional
stipulation that this is not just a smil issue, but an issue for all XML
languages that have non-text content.

For the next version of SMIL, we plan to adopt SVG's description element,
which would allow you to do something like this in SMIL:

<par>
	<audio src="snippet8043.wav">
		<description xml:lang="en">
			The lady in the pink sweater picks up the pearl
necklace from the table and walks to the door.
		<description/>
		<description xml:lang="fr">
			Oui.
		<description/>	
	</audio>
/par>

Having an attribute on elements that are specially meant to be a literal
text translation of (possibly long) media does not scale well. The sub
elements make more sense.

I think that this is the beginning of discussion about the need to create a
set of reusable markup elements that fit the indentified needs. I can
imagine <description>, <transcription>, and <title> child elements, all
enclosing text.

My point is that these are real problems that need solutions, but the
solutions need to be general, reusable and thought out in detail. This will
require some dedicated people and some time. This is way too late in the
SMIL 2.0 process to start integrating this kind of thing into the language,
but it is something that should be done for re-use by everyone and
integrated into SMIL (and XHTML 2.0?, SVG?) in the future.


-Aaron

> -----Original Message-----
> From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> Sent: Friday, October 27, 2000 12:30 PM
> To: Cohen, Aaron M
> Cc: geoff freed; Hansen, Eric; www-smil@w3.org; thierry michel;
> www-smil-request@w3.org
> Subject: Re: Synthesized-speech auditory descriptions
> 
> 
> Aaron,
> I think the actual transcription of the audio deserves its own tag,
> since it's so specific. For the same reason that you created a
> systemAudioDesc tag and didn't just use the alt tag.  You need a place
> to look that's consistent.  I believe the longdesc is intented to be
> used as simply a longer text description of the unnderlying graphic or
> media file. And in the case of audio description snippets, 
> the longdesc
> could be used to hold timing or other metadata related to the snippet
> but not specifically voiced. I think that verbatim text will prove
> invaluable in the future, for searching, etc., and you should consider
> creating a specific tag for this.
> --Brad
> \_\_\_\_\_\_\_\_\_\_
> Brad_Botkin@wgbh.org   Director, Technology & Systems Development
> 617.300.3902 (v/f)               NCAM/WGBH - National Center for 
> 125 Western Ave Boston MA 02134              Accessible Media
> \_\_\_\_\_\_\_\_\_\_
> 
> 
> "Cohen, Aaron M" wrote:
> > 
> > Brad:
> > 
> > We also have alt and longdesc, either of which could be 
> used by a player to
> > provide accessory or alternative text content. These can be 
> combined with
> > the systemLanguage and other test attributes to provide 
> many combinations of
> > accessiblity and internationalization.
> > -Aaron
> > 
> > > -----Original Message-----
> > > From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> > > Sent: Friday, October 27, 2000 5:41 AM
> > > To: geoff freed
> > > Cc: Hansen, Eric; www-smil@w3.org; thierry michel;
> > > www-smil-request@w3.org
> > > Subject: Re: Synthesized-speech auditory descriptions
> > >
> > >
> > > Geoff,
> > > True but incomplete.  It sounds like Eric is asking for a tag
> > > which identifies text as a transcription of the underlying
> > > audio.   Something like:
> > >
> > > <par>
> > > .....
> > >     <audio    systemAudioDesc="on"
> > >                     AudioDescText="The lady in the pink
> > > sweater picks up the pearl necklace from the table and 
> walks to the
> > > door."
> > >                     src="snippet8043.wav"/>
> > > .....
> > > </par>
> > >
> > > It's a great idea, since the text is super-thin, making it
> > > appropriate for transmission in narrow pipes with local
> > > text-to-speech synthesis for playback.  Note that the volume
> > > of snippets in a longer piece, like a movie, is huge, just
> > > like closed captions.  Inclusion of 1000 audio description
> > > snippets and 2000 closed captions, each in 3 languages, each
> > > with its own timecode, all in the same SMIL file will make
> > > for some *very* unfriendly  files.  Better would be to provide a
> > > mechanism which allows the SMIL file to gracefully point to
> > > separate files each containing the timecoded AD snippets (with
> > > transcriptions per the above) and timecoded captions.  It
> > > requires the SMIL player to gracefully overlay the external
> > > timeline onto the intrinsic timeline of the SMIL file.
> > > Without this, SMIL won't be used for interchange of caption and
> > > description data for anything longer than a minute or two.  A
> > > translation house shouldn't have to unwind a bazillion audio
> > > descriptions and captions in umpteen other languages to
> > > insert its French translation.
> > >
> > > Regards,
> > > --Brad
> > > \_\_\_\_\_\_\_\_\_\_\_
> > > Brad_Botkin@wgbh.org   Director, Technology & Systems Development
> > > (v/f) 617.300.3902               NCAM/WGBH - National Center for
> > > 125 Western Ave Boston MA 02134              Accessible Media
> > > \_\_\_\_\_\_\_\_\_\_\_
> > >
> > >
> > > geoff freed wrote:
> > >
> > > > Hi, Eric:
> > > >
> > > > SMIL 2.0 provides support for audio descriptions via a test
> > > attribute, systemAudioDesc.  The author can record audio
> > > >  descriptions digitally and synchronize them into a SMIL
> > > presentation using this attribute, similar to how captions are
> > > >  synchronized into SMIl presentations using systemCaptions
> > > (or system-captions, as it is called in SMIL 1.0).
> > > >
> > > > Additionally, using SMIL2.0's <excl> and <priorityClass>
> > > elements, the the author may pause a video track
> > > >  automatically, play an extended audio description and,
> > > when the description is finished, resume playing the video
> > > >  track.  This will be a boon for situations  where the
> > > natural pauses in the program audio aren't sufficient for audio
> > > >  descriptions.
> > > >
> > > > Geoff Freed
> > > > CPB/WGBH National Center for Accessible Media (NCAM)
> > > > WGBH Educational Foundation
> > > > geoff_freed@wgbh.org
> > > >
> > > > On Wednesday, October 25, 2000, thierry michel
> > > <tmichel@w3.org> wrote:
> > > > >
> > > > >> My questions concern the use of SMIL for developing
> > > auditory descriptions
> > > > >> for multimedia presentations.
> > > > >>
> > > > >> The Web Content Accessibility Guidelines (WCAG) version
> > > 1.0 of W3C/WAI
> > > > >> indicates the possibility of using speech synthesis for
> > > providing auditory
> > > > >> descriptions for multimedia presentations. Specifically,
> > > checkpoint 1.3 of
> > > > >> WCAG 1.0 reads:
> > > > >>
> > > > >> "1.3 Until user agents can automatically read aloud the
> > > text equivalent of
> > > > >a
> > > > >> visual track, provide an auditory description of the
> > > important information
> > > > >> of the visual track of a multimedia presentation. 
> [Priority 1]
> > > > >> Synchronize the auditory description with the audio 
> track as per
> > > > >checkpoint
> > > > >> 1.4. Refer to checkpoint 1.1 for information about
> > > textual equivalents for
> > > > >> visual information." (WCAG 1.0, checkpoint 1.3).
> > > > >>
> > > > >> In the same document in the definition of 
> "Equivalent", we read:
> > > > >>
> > > > >> "One example of a non-text equivalent is an auditory
> > > description of the
> > > > >key
> > > > >> visual elements of a presentation. The description is
> > > either a prerecorded
> > > > >> human voice or a synthesized voice (recorded or
> > > generated on the fly). The
> > > > >> auditory description is synchronized with the audio 
> track of the
> > > > >> presentation, usually during natural pauses in the audio
> > > track. Auditory
> > > > >> descriptions include information about actions, body
> > > language, graphics,
> > > > >and
> > > > >> scene changes."
> > > > >>
> > > > >> My questions are as follows:
> > > > >>
> > > > >> 1. Does SMIL 2.0 support the development of synthesized
> > > speech auditory
> > > > >> descriptions?
> > > > >>
> > > > >> 2. If the answer to question #1 is "Yes", then briefly
> > > describe the
> > > > >support
> > > > >> that is provided.
> > > > >>
> > > > >> 3. If the answer to question #1 is "No", then please
> > > describe any plans
> > > > >for
> > > > >> providing such support in the future.
> > > > >>
> > > > >> Thanks very much for your consideration.
> > > > >>
> > > > >> - Eric G. Hansen
> > > > >> Development Scientist
> > > > >> Educational Testing Service (ETS)
> > > > >> Princeton, NJ 08541
> > > > >> ehansen@ets.org
> > > > >> Co-Editor, W3C/WAI User Agent Accessibility Guidelines
> > > > >>
> > > > >
> > >
> > >
> 
>
Received on Friday, 27 October 2000 16:25:28 UTC