Re: Synthesized-speech auditory descriptions from Brad Botkin on 2000-11-07 (www-smil@w3.org from October to December 2000)

From: Brad Botkin <brad_botkin@wgbh.org>
Date: Tue, 07 Nov 2000 15:12:04 -0500
To: "Cohen, Aaron M" <aaron.m.cohen@intel.com>
CC: geoff freed <geoff_freed@wgbh.org>, "Hansen, Eric" <ehansen@ets.org>, "'symm@w3.org'" <symm@w3.org>, www-smil@w3.org, thierry michel <tmichel@w3.org>, www-smil-request@w3.org
Message-ID: <3A086214.DDDA122A@wgbh.org>
Aaron,
Fair enough. Let's table it until the next round's requirements start. 
I agree that there's simply a hole to fill, and I didn't mean to single
out SMIL.  But SMIL may be the best place to fill *all* the hole left by
other languages.
--Brad
\_\_\_\_\_\_\_\_\_\_
Brad_Botkin@wgbh.org   Director, Technology & Systems Development
617.300.3902 (v/f)               NCAM/WGBH - National Center for 
125 Western Ave Boston MA 02134              Accessible Media
\_\_\_\_\_\_\_\_\_\_


"Cohen, Aaron M" wrote:
> 
> Brad:
> 
> I'm not trying to keep going round and round about this, and I do understand
> what you are saying, but I think that you are missing what I am saying.
> Simply (but lengthly, please forgive me):
> 
> 1. SMIL 2.0 supports the current standard set of accessibility attributes on
> _all_ elements. These are the attributes recommended by the WAI team and
> used in other W3C recommendations.
> 
> 2. SMIL 2.0, like SMIL 1.0, is otherwise media agnostic. It is an author's
> responsibility to include the media that they make available to users, and
> alternatives to that media. This is a different means of creating accessible
> presentations than fixed attributes, but it is more flexible and integrates
> well with SMIL.
> 
> 3. SMIL 1.0's accessibility was considered an improvement by WAI at the
> time, and SMIL 2.0 provides even more accessibility. We have responded to,
> and included, most of the features that WAI requested prior to last call.
> 
> 4. The issue that you bring up, that synthesized-speech text equivalents
> needs to live in a specific, guarenteed place, is not specifically a smil
> issue. SMIL does support this, but not in the particular manner that you
> request. This is where I suggested that some general markup be developed to
> be reused in multiple languages. Much like alt and longdesc are now.
> 
> 5. As you imply, requiring specific media support such as suggested in #3
> and beyond #1 and #2 is not an appropriate new topic at this stage. It _is_
> appropriate during a requirements gathering phrase or as comments when
> working drafts are released. Certainly, this is a valid issue to consider
> for the next version.
> 
> 6. Granted that synthesized-speech text equivalents are worth considering,
> it's not absolutely clear to me that these require/deserve their own
> special-use attributes and elements, as opposed to alt and longdesc.
> Synthesized-speech can be considered just a rendering method. Selecting
> exactly what content needs to be rendered may best be decided by authors
> providing the media to a user, and the user agent providing the user with
> options. Such as an option to "render text media with synthesized-speech",
> or "render the alt text with synthesized-speech". It's not clear to me why
> the synthesized speech text data should be at the same basic level as the
> alt and longdesc attributes are, but I do think that the topic warrants
> discussion and some conscientious design, not just a quick fix.
> 
> 7. Saying that SMIL 2.0 accessibility is not ready for primetime using an
> argument that applies to all W3C languages (XHTML, SVG, etc.) means that the
> W3C has a hole to fill, not that SMIL is lacking the current standard of
> support or that the SYMM working group did not take into account design
> requirements. These other languages are not going to be delayed for
> synthesized speech, and neither should smil.
> 
> What I suggest is that the WAI team and other interested parties come to
> some consensus on what kind of support for synthesized-speech is necessary,
> and SYMM will strongly consider this recommendation for the next version.
> Other WG's should be informed as well, since the same issues apply to them.
> 
> -Aaron
> 
> > -----Original Message-----
> > From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> > Sent: Tuesday, November 07, 2000 4:13 AM
> > To: Cohen, Aaron M
> > Cc: Hansen, Eric; 'symm@w3.org'; geoff freed; www-smil@w3.org; thierry
> > michel; www-smil-request@w3.org
> > Subject: Re: Synthesized-speech auditory descriptions
> >
> >
> > Aaron,
> > I apologize for being such a pest about this.  The problem with
> >
> >       > > ... an implementation can provide rendering this
> >       > > (<text> element) text via voice synthesis
> >
> > is that in order for player and browser developers to
> > incorporate accessibility features such as
> > captioning, audio description, tts-audio desc, etc., they
> > need to be *GUARANTEED* that the data
> > lives in a particular place. That is, it must be unambiguous
> > to the parser that it's picking up the
> > desired source data.  The access data is very specific
> > metadata, which can't *MAYBE* live in the
> > <text> element, *MAYBE* live in the <alt> element, *MAYBE*
> > live in the <longdesc> element.  Most
> > metadata is simply embellishment.  Access metadata *IS* the
> > data, just in another format.  You
> > supply a videoregion and a src=..., precisely so that the
> > display engine knows what to play.  You
> > could just as easily say
> >
> >       "maybe the media filename can live in the <alt> tag.
> >       Sometimes it will, sometimes it won't, good luck."
> >
> > Accessibility in SMIL is not about creating spots in SMIL for
> > just another pretty presentation
> > element. It's about allowing the essence of the presentation
> > to be found and rendered.
> >
> > I understand that it may be late in the SMIL 2.0 game to be
> > talking about any additional
> > accessibility-specific markup, but it's my opinion that the
> > need is immediate, accessibility in SMIL
> > is not ready for primetime, and SMIL 2.0 can go forward
> > without it but it will need to be raised
> > immediately in the next round, with additional specific
> > markup.  That SMIL is media-agnostic is
> > necessary but not sufficient for rational implementation of
> > accessibility.
> >
> > --Brad
> > \_\_\_\_\_\_\_\_\_\_\_
> > Brad_Botkin@wgbh.org   Director, Technology & Systems Development
> > (v/f) 617.300.3902               NCAM/WGBH - National Center for
> > 125 Western Ave Boston MA 02134              Accessible Media
> > \_\_\_\_\_\_\_\_\_\_\_
> >
> >
> > "Cohen, Aaron M" wrote:
> > >
> > > Eric:
> > > I don't interpret the guidelines the way that you do. It
> > seems that you
> > > assume that alt and longdesc cannot be rendered by
> > synthesized speech. Also,
> > > we have a <text> element, and an implementation can provide
> > rendering this
> > > text via voice synthesis.
> > >
> > > Where we seem to differ is that it seems that your preference is for
> > > specialized synthesized speech markup, where I think that
> > much of what we
> > > already have can be used.
> > >
> > > The exploratory comments that I made were in relation to
> > specialized support
> > > for synthesized speech, not to say that there is no way to
> > incorporate
> > > synthetic speech into a smil presentation.
> > >
> > > Here is how I answer these specific questions:
> > > > 1. Does SMIL 2.0 support the development of synthesized
> > > > speech auditory
> > > > descriptions?
> > > Yes. SMIL 2.0, like SMIL 1.0, is media agnostic. Any type
> > of media can be
> > > supported in SMIL. It is up to the implementation to
> > provide rendering for
> > > the supported media types, and alternative rendering
> > methods to enhance
> > > accessibility.
> > >
> > > > SMIL does not currently support synthesized speech auditory
> > > > descriptions. It
> > > > does support prerecorded auditory descriptions.
> > > This is not so. SMIL has exactly the same support for
> > synthesized speech
> > > auditory descriptions as it does for pre-recorded auditory
> > descriptions.
> > > SMIL is a media integration language, and does not define
> > media itself.
> > >
> > > The text that you quote does not call out synthetic speech
> > specifically, but
> > > it is not excluded.
> > >
> > > > 2. If the answer to question #1 is "Yes", then briefly
> > > > describe the support
> > > > that is provided.
> > > 1. A user agent can render alt/longdesc as synthesized speech.
> > > 2. A user agent can provide a synthetic speech renderer for
> > <text> media
> > > elements.
> > > 3. A user can control the rendered media via system
> > preferences which map to
> > > system test attributes. This allows the author to set the
> > synthesized speech
> > > up as captions or overdub or audio descriptions.
> > > 4. SMIL 2.0 has author defined customTest attributes, to
> > allow turning
> > > on/off media based on document and user specific criteria.
> > >
> > > -Aaron
> > >
> > > > -----Original Message-----
> > > > From: Hansen, Eric [mailto:ehansen@ets.org]
> > > > Sent: Wednesday, November 01, 2000 8:14 AM
> > > > To: 'Cohen, Aaron M'; 'Brad Botkin'
> > > > Cc: 'symm@w3.org'; geoff freed; Hansen, Eric;
> > www-smil@w3.org; thierry
> > > > michel; www-smil-request@w3.org
> > > > Subject: RE: Synthesized-speech auditory descriptions
> > > >
> > > >
> > > > I have an additional comment and then I will summarize.
> > > >
> > > > SOME BASIC REQUIREMENTS FOR MULTIMEDIA PRESENTATIONS
> > > >
> > > > From the glossary entry for the term "Equivalent" in the W3C
> > > > Web Content
> > > > Accessibility Guidelines (WCAG) 1.0 [3], we see that
> > > > regarding multimedia
> > > > presentations that are three major forms of equivalent:
> > > > captions, auditory
> > > > descriptions, and collated text transcript.
> > > >
> > > > "A caption is a text transcript for the audio track of a
> > > > video presentation
> > > > that is synchronized with the video and audio tracks.
> > > > Captions are generally
> > > > rendered visually by being superimposed over the video,
> > which benefits
> > > > people who are deaf and hard-of-hearing, and anyone who
> > > > cannot hear the
> > > > audio (e.g., when in a crowded room). A collated text
> > > > transcript combines
> > > > (collates) captions with text descriptions of video information
> > > > (descriptions of the actions, body language, graphics, and
> > > > scene changes of
> > > > the video track). These text equivalents make presentations
> > > > accessible to
> > > > people who are deaf-blind and to people who cannot play
> > > > movies, animations,
> > > > etc. It also makes the information available to search engines.
> > > >
> > > > "One example of a non-text equivalent is an auditory
> > > > description of the key
> > > > visual elements of a presentation. The description is either
> > > > a prerecorded
> > > > human voice or a synthesized voice (recorded or generated on
> > > > the fly). The
> > > > auditory description is synchronized with the audio track of the
> > > > presentation, usually during natural pauses in the audio
> > > > track. Au ditory
> > > > descriptions include information about actions, body
> > > > language, graphics, and
> > > > scene changes."
> > > >
> > > > See
> > > > It appears that SMIL 2.0 provides support for captions
> > and prerecorded
> > > > auditory descriptions but not for synthesized speech auditory
> > > > descriptions
> > > > or collated text transcripts. I have already pointed out the
> > > > importance of
> > > > synthesized speech auditory descriptions (see WCAG 1.0 checkpoint
> > > >
> > > >
> > > > 1.1 Provide a text equivalent for every non-text element
> > > > (e.g., via "alt",
> > > > "longdesc", or in element content). This includes:
> > images, graphical
> > > > representations of text (including symbols), image map
> > > > regions, animations
> > > > (e.g., animated GIFs), applets and programmatic objects,
> > > > ascii art, frames,
> > > > scripts, images used as list bullets, spacers, graphical
> > > > buttons, sounds
> > > > (played with or without user interaction), stand-alone audio
> > > > files, audio
> > > > tracks of video, and video. [Priority 1]
> > > > For example, in HTML:
> > > > Use "alt" for the IMG, INPUT, and APPLET elements, or
> > provide a text
> > > > equivalent in the content of the OBJECT and APPLET elements.
> > > > For complex content (e.g., a chart) where the "alt" text does
> > > > not provide a
> > > > complete text equivalent, provide an additional
> > description using, for
> > > > example, "longdesc" with IMG or FRAME, a link inside an
> > > > OBJECT element, or a
> > > > description link.
> > > > For image maps, either use the "alt" attribute with AREA, or
> > > > use the MAP
> > > > element with A elements (and other text) as content.
> > > > Refer also to checkpoint 9.1 and checkpoint 13.10.
> > > >
> > > > Techniques for checkpoint 1.1
> > > > 1.3 Until user agents can automatically read aloud the text
> > > > equivalent of a
> > > > visual track, provide an auditory description of the
> > > > important information
> > > > of the visual track of a multimedia presentation. [Priority 1]
> > > > Synchronize the auditory description with the audio track as
> > > > per checkpoint
> > > > 1.4. Refer to checkpoint 1.1 for information about textual
> > > > equivalents for
> > > > visual information.
> > > > Techniques for checkpoint 1.3
> > > > 1.4 For any time-based multimedia presentation (e.g., a movie
> > > > or animation),
> > > > synchronize equivalent alternatives (e.g., captions or
> > > > auditory descriptions
> > > > of the visual track) with the presentation. [Priority 1]
> > > >
> > > >
> > > > I am trying to summarize what has been said to this point on
> > > > this thread
> > > > that responds to my earlier questions [1]
> > > >
> > > > SUMMARY
> > > >
> > > > 1. Does SMIL 2.0 support the development of synthesized
> > > > speech auditory
> > > > descriptions?
> > > >
> > > > SMIL does not currently support synthesized speech auditory
> > > > descriptions. It
> > > > does support prerecorded auditory descriptions.
> > > >
> > > > 2. If the answer to question #1 is "Yes", then briefly
> > > > describe the support
> > > > that is provided.
> > > >
> > > > N/A
> > > >
> > > > 3. If the answer to question #1 is "No", then please describe
> > > > any plans for
> > > > providing such support in the future.
> > > >
> > > > There are currently no plans for including this in SMIL.
> > Aaron Cohen
> > > > suggests that "Probably what is needed is a general
> > > > accessible markup that
> > > > can be used in SMIL, XHTML, SVG, etc. SMIL would just
> > adopt this as a
> > > > content type. This new content type could be designed to
> > > > resuse a lot of
> > > > SMIL content control, and it could have additional
> > > > indirection mechanisms to
> > > > enable the kind of structured grouping that you mention. But
> > > > that's another
> > > > spec, and for now the vendors are doing their own thing." [2]
> > > >
> > > > ====
> > > >
> > > > COMMENT
> > > >
> > > > It seems to me that if SMIL 2.0 proceeds to Recommendation
> > > > status, it would
> > > > be good to have done several things.
> > > >
> > > > 1. Affirm W3C's commitment to suppporting Web accessbility,
> > > > particularly the
> > > > multimedia-related requirements of the Web Content
> > > > Accessibility Guidelines
> > > > (WCAG), User Agent Accessibility Guidelines (UAAG), Authoring Tool
> > > > Accessibility Guidelines (ATAG). Captions, auditory
> > descriptions, and
> > > > collated text transcripts stand out in my mind in this
> > > > regard. (See WCAG 1.0
> > > > [3]).
> > > >
> > > > 2. Explain why synthesized speech auditory descriptions are
> > > > not or cannot be
> > > > part of the SMIL 2.0 specification.
> > > >
> > > > 3. Suggest a plan for supporting synthesized speech auditory
> > > > descriptions. I
> > > > personally would like to see some kind commitment from the
> > > > W3C to support
> > > > this, either as part of the next version of SMIL or perhaps
> > > > as Aaron has
> > > > suggested, another specification that could be reused by
> > > > SMIL, XHTML, SVG,
> > > > etc.
> > > >
> > > > 4. Suggest techniques for providing such auditory
> > > > descriptions and collated
> > > > text transcripts  until they are fully integrated into W3C
> > > > specifications.
> > > >
> > > > I think that it would be appropriate to have at least a
> > > > summary of such
> > > > information as part of the Recommendation. I am concerned
> > > > that without such
> > > > information within the document, people may doubt the W3C's
> > > > commitment to
> > > > accessible media.
> > > >
> > > >
> > > >
> > > > [1]
> > http://lists.w3.org/Archives/Public/www-smil/2000OctDec/0050.html
> > > > [2]
> > http://lists.w3.org/Archives/Public/www-smil/2000OctDec/0062.html
> > > > [3] http://www.w3.org/TR/WAI-WEBCONTENT/
> > > >
> > > > -----Original Message-----
> > > > From: Cohen, Aaron M [mailto:aaron.m.cohen@intel.com]
> > > > Sent: Monday, October 30, 2000 12:40 PM
> > > > To: 'Brad Botkin'
> > > > Cc: 'symm@w3.org'; geoff freed; Hansen, Eric;
> > www-smil@w3.org; thierry
> > > > michel; www-smil-request@w3.org
> > > > Subject: RE: Synthesized-speech auditory descriptions
> > > >
> > > >
> > > > Brad:
> > > > As far as the systemAudioDesc only taking on/off, that's
> > > > true, but you can
> > > > combine it with the other test attributes, such as
> > > > systemLanguage, and get
> > > > many, many combinations. Geoff Freed and the WAI people are
> > > > reviewing those
> > > > combinations for completeness, so if you think that we
> > are missing a
> > > > specific use case, please let us know.
> > > >
> > > > As far as separate text files for accessibility documents,
> > > > you are right,
> > > > that's a thorny issue for SMIL, which has left the definition
> > > > of media (as
> > > > opposed to the integration) to the player/content developers.
> > > >
> > > > Probably what is needed is a general accessible markup that
> > > > can be used in
> > > > SMIL, XHTML, SVG, etc. SMIL would just adopt this as a
> > > > content type. This
> > > > new content type could be designed to resuse a lot of SMIL
> > > > content control,
> > > > and it could have additional indirection mechanisms to enable
> > > > the kind of
> > > > structured grouping that you mention. But that's another
> > > > spec, and for now
> > > > the vendors are doing their own thing.
> > > >
> > > > -Aaron
> > > >
> > > > > -----Original Message-----
> > > > > From: Brad Botkin [mailto:Brad_Botkin@wgbh.org]
> > > > > Sent: Sunday, October 29, 2000 4:33 AM
> > > > > To: Cohen, Aaron M
> > > > > Cc: 'symm@w3.org'; geoff freed; Hansen, Eric;
> > > > www-smil@w3.org; thierry
> > > > > michel; www-smil-request@w3.org
> > > > > Subject: Re: Synthesized-speech auditory descriptions
> > > > >
> > > > >
> > > > > Aaron,
> > > > > What seems to be missing from
> > > > >
> > > > > > > <par>
> > > > > > >         <audio src="snippet8043.wav">
> > > > > > >                 <description xml:lang="en">
> > > > > > >                         The lady in the pink sweater
> > > > > picks up the pearl
> > > > > > > necklace from the table and walks to the door.
> > > > > > >                 <description/>
> > > > > > >                 <description xml:lang="fr">
> > > > > > >                         Oui.
> > > > > > >                 <description/>
> > > > > > >         </audio>
> > > > > > > /par>
> > > > >
> > > > > is a way to uniquely and unambiguously identify the text
> > > > above as the
> > > > > audio description (unless the <description> tag is just
> > that, but I
> > > > > assume "<description xml....>" here is a generic term
> > unrelated to
> > > > > "audio description" as we're talking about it).
> > > > >
> > > > > The <systemAudioDesc> tag is a way to signal a player that some
> > > > > particular content should be played for some users.  But
> > > > the specific
> > > > > rendering device has the job of deciding which media
> > > > element to play,
> > > > > the audio (uniquely identified by the "src" attribute) or the
> > > > > transcription of that element (not yet uniquely identified).
> > > > >
> > > > > The point is that there may be more than just one text string
> > > > > associated
> > > > > with an audio element, only one of which is the
> > > > transcription of that
> > > > > audio.  <systemAudioDesc> *almost* spoke to this need,
> > > > except that it
> > > > > only takes an "on/off" value, which seems insufficient to
> > > > the task of
> > > > > allowing rendering engines to adequately handle
> > > > accessibility issues.
> > > > > Since accessibility is being legislated in the tv and
> > > > multimedia arena
> > > > > as we speak, it seems prudent to create a set of extensible
> > > > > accessibility tags which will allow those industries to
> > > > easily utilize
> > > > > SMIL in their workflow.  It's true that these elements
> > would not be
> > > > > general, reuseable ones, and I sympathize with your reticence to
> > > > > generate more case markup. Nonetheless....
> > > > >
> > > > > In another vein, how about the issue of how to manage the
> > > > grouping of
> > > > > synched accessibility objects (captions and descriptions,
> > > > for example)
> > > > > in eparate text files.  I'm sure this is thorny, but
> > the current
> > > > > existing formats (RealText, SAMI, Quicktime qtText) all
> > > > offer a way to
> > > > > group these related elements (for captioning).  Current
> > thoughts?
> > > > >
> > > > > --Brad
> > > > > \_\_\_\_\_\_\_\_\_\_\_
> > > > > Brad_Botkin@wgbh.org   Director, Technology & Systems
> > Development
> > > > > (v/f) 617.300.3902               NCAM/WGBH - National Center for
> > > > > 125 Western Ave Boston MA 02134              Accessible Media
> > > > > \_\_\_\_\_\_\_\_\_\_\_
> > > > >
> > > > >
> > > > >
> > > > > "Cohen, Aaron M" wrote:
> > > > > >
> > > > > > Brad:
> > > > > > That specific use of verbatim text is what systemAudioDesc
> > > > > is for. It can be
> > > > > > used on text media elements that can contain the verbatim
> > > > > text. The pair of
> > > > > > audio and text elements can be wrapped in a par and given a
> > > > > specific title,
> > > > > > and the unit used in a presentation just like an individual
> > > > > media element.
> > > > > >
> > > > > > Why would it be better to have special case markup when the
> > > > > generalized
> > > > > > capabilities that we have cover the use cases?
> > > > > >
> > > > > > Your example confuses me, since it doesn't seem to give any
> > > > > more capability
> > > > > > than we already have with XHTML+SMIL:
> > > > > >
> > > > > > <par>
> > > > > >         <audio src="snippet8043.wav"/>
> > > > > >         <p systemAudioDesc="on">The lady in the pink
> > > > > sweater picks up the
> > > > > > pearl necklace from the table and walks to the door.</p>
> > > > > > </par>
> > > > > >
> > > > > > Even less, since you can't hang an xml:lang off the
> > > > > attribute, necessitating
> > > > > > duplication of the media object reference for each langauge
> > > > > of the text
> > > > > > description.
> > > > > >
> > > > > > With SMIL 2.0, you have to put the text in alt or another
> > > > > file, because SMIL
> > > > > > does not itself define media:
> > > > > > <par>
> > > > > >         <audio src="snippet8043.wav"/>
> > > > > >         <text systemAudioDesc="on" src="lady.txt/>
> > > > > > /par>
> > > > > >
> > > > > > If you are saying that there should be some general
> > > > > scalable mechanism to
> > > > > > make this easier to maintain, I agree with you, with the
> > > > additional
> > > > > > stipulation that this is not just a smil issue, but an
> > > > > issue for all XML
> > > > > > languages that have non-text content.
> > > > > >
> > > > > > For the next version of SMIL, we plan to adopt SVG's
> > > > > description element,
> > > > > > which would allow you to do something like this in SMIL:
> > > > > >
> > > > > > <par>
> > > > > >         <audio src="snippet8043.wav">
> > > > > >                 <description xml:lang="en">
> > > > > >                         The lady in the pink sweater picks
> > > > > up the pearl
> > > > > > necklace from the table and walks to the door.
> > > > > >                 <description/>
> > > > > >                 <description xml:lang="fr">
> > > > > >                         Oui.
> > > > > >                 <description/>
> > > > > >         </audio>
> > > > > > /par>
> > > > > >
> > > > > > Having an attribute on elements that are specially meant to
> > > > > be a literal
> > > > > > text translation of (possibly long) media does not scale
> > > > > well. The sub
> > > > > > elements make more sense.
> > > > > >
> > > > > > I think that this is the beginning of discussion about the
> > > > > need to create a
> > > > > > set of reusable markup elements that fit the indentified
> > > > > needs. I can
> > > > > > imagine <description>, <transcription>, and <title> child
> > > > > elements, all
> > > > > > enclosing text.
> > > > > >
> > > > > > My point is that these are real problems that need
> > > > > solutions, but the
> > > > > > solutions need to be general, reusable and thought out in
> > > > > detail. This will
> > > > > > require some dedicated people and some time. This is way
> > > > > too late in the
> > > > > > SMIL 2.0 process to start integrating this kind of thing
> > > > > into the language,
> > > > > > but it is something that should be done for re-use by
> > everyone and
> > > > > > integrated into SMIL (and XHTML 2.0?, SVG?) in the future.
> > > > > >
> > > > > > -Aaron
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> > > > > > > Sent: Friday, October 27, 2000 12:30 PM
> > > > > > > To: Cohen, Aaron M
> > > > > > > Cc: geoff freed; Hansen, Eric; www-smil@w3.org;
> > thierry michel;
> > > > > > > www-smil-request@w3.org
> > > > > > > Subject: Re: Synthesized-speech auditory descriptions
> > > > > > >
> > > > > > >
> > > > > > > Aaron,
> > > > > > > I think the actual transcription of the audio deserves
> > > > > its own tag,
> > > > > > > since it's so specific. For the same reason that
> > you created a
> > > > > > > systemAudioDesc tag and didn't just use the alt tag.  You
> > > > > need a place
> > > > > > > to look that's consistent.  I believe the longdesc is
> > > > > intented to be
> > > > > > > used as simply a longer text description of the
> > > > > unnderlying graphic or
> > > > > > > media file. And in the case of audio description snippets,
> > > > > > > the longdesc
> > > > > > > could be used to hold timing or other metadata related to
> > > > > the snippet
> > > > > > > but not specifically voiced. I think that verbatim text
> > > > will prove
> > > > > > > invaluable in the future, for searching, etc., and you
> > > > > should consider
> > > > > > > creating a specific tag for this.
> > > > > > > --Brad
> > > > > > > \_\_\_\_\_\_\_\_\_\_
> > > > > > > Brad_Botkin@wgbh.org   Director, Technology & Systems
> > > > Development
> > > > > > > 617.300.3902 (v/f)               NCAM/WGBH -
> > National Center for
> > > > > > > 125 Western Ave Boston MA 02134
> > Accessible Media
> > > > > > > \_\_\_\_\_\_\_\_\_\_
> > > > > > >
> > > > > > >
> > > > > > > "Cohen, Aaron M" wrote:
> > > > > > > >
> > > > > > > > Brad:
> > > > > > > >
> > > > > > > > We also have alt and longdesc, either of which could be
> > > > > > > used by a player to
> > > > > > > > provide accessory or alternative text content.
> > These can be
> > > > > > > combined with
> > > > > > > > the systemLanguage and other test attributes to provide
> > > > > > > many combinations of
> > > > > > > > accessiblity and internationalization.
> > > > > > > > -Aaron
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Brad Botkin [mailto:brad_botkin@wgbh.org]
> > > > > > > > > Sent: Friday, October 27, 2000 5:41 AM
> > > > > > > > > To: geoff freed
> > > > > > > > > Cc: Hansen, Eric; www-smil@w3.org; thierry michel;
> > > > > > > > > www-smil-request@w3.org
> > > > > > > > > Subject: Re: Synthesized-speech auditory descriptions
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Geoff,
> > > > > > > > > True but incomplete.  It sounds like Eric is asking
> > > > for a tag
> > > > > > > > > which identifies text as a transcription of the
> > underlying
> > > > > > > > > audio.   Something like:
> > > > > > > > >
> > > > > > > > > <par>
> > > > > > > > > .....
> > > > > > > > >     <audio    systemAudioDesc="on"
> > > > > > > > >                     AudioDescText="The lady in the pink
> > > > > > > > > sweater picks up the pearl necklace from the table and
> > > > > > > walks to the
> > > > > > > > > door."
> > > > > > > > >                     src="snippet8043.wav"/>
> > > > > > > > > .....
> > > > > > > > > </par>
> > > > > > > > >
> > > > > > > > > It's a great idea, since the text is
> > super-thin, making it
> > > > > > > > > appropriate for transmission in narrow pipes with local
> > > > > > > > > text-to-speech synthesis for playback.  Note
> > that the volume
> > > > > > > > > of snippets in a longer piece, like a movie, is
> > huge, just
> > > > > > > > > like closed captions.  Inclusion of 1000 audio
> > description
> > > > > > > > > snippets and 2000 closed captions, each in 3
> > languages, each
> > > > > > > > > with its own timecode, all in the same SMIL
> > file will make
> > > > > > > > > for some *very* unfriendly  files.  Better would be
> > > > > to provide a
> > > > > > > > > mechanism which allows the SMIL file to
> > gracefully point to
> > > > > > > > > separate files each containing the timecoded AD
> > > > snippets (with
> > > > > > > > > transcriptions per the above) and timecoded
> > captions.  It
> > > > > > > > > requires the SMIL player to gracefully overlay
> > the external
> > > > > > > > > timeline onto the intrinsic timeline of the SMIL file.
> > > > > > > > > Without this, SMIL won't be used for interchange of
> > > > > caption and
> > > > > > > > > description data for anything longer than a minute
> > > > or two.  A
> > > > > > > > > translation house shouldn't have to unwind a
> > bazillion audio
> > > > > > > > > descriptions and captions in umpteen other languages to
> > > > > > > > > insert its French translation.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > --Brad
> > > > > > > > > \_\_\_\_\_\_\_\_\_\_\_
> > > > > > > > > Brad_Botkin@wgbh.org   Director, Technology & Systems
> > > > > Development
> > > > > > > > > (v/f) 617.300.3902               NCAM/WGBH - National
> > > > > Center for
> > > > > > > > > 125 Western Ave Boston MA 02134
> > > > Accessible Media
> > > > > > > > > \_\_\_\_\_\_\_\_\_\_\_
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > geoff freed wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Eric:
> > > > > > > > > >
> > > > > > > > > > SMIL 2.0 provides support for audio descriptions
> > > > via a test
> > > > > > > > > attribute, systemAudioDesc.  The author can record audio
> > > > > > > > > >  descriptions digitally and synchronize them
> > into a SMIL
> > > > > > > > > presentation using this attribute, similar to how
> > > > captions are
> > > > > > > > > >  synchronized into SMIl presentations using
> > systemCaptions
> > > > > > > > > (or system-captions, as it is called in SMIL 1.0).
> > > > > > > > > >
> > > > > > > > > > Additionally, using SMIL2.0's <excl> and
> > <priorityClass>
> > > > > > > > > elements, the the author may pause a video track
> > > > > > > > > >  automatically, play an extended audio
> > description and,
> > > > > > > > > when the description is finished, resume
> > playing the video
> > > > > > > > > >  track.  This will be a boon for situations  where the
> > > > > > > > > natural pauses in the program audio aren't sufficient
> > > > > for audio
> > > > > > > > > >  descriptions.
> > > > > > > > > >
> > > > > > > > > > Geoff Freed
> > > > > > > > > > CPB/WGBH National Center for Accessible Media (NCAM)
> > > > > > > > > > WGBH Educational Foundation
> > > > > > > > > > geoff_freed@wgbh.org
> > > > > > > > > >
> > > > > > > > > > On Wednesday, October 25, 2000, thierry michel
> > > > > > > > > <tmichel@w3.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > >> My questions concern the use of SMIL for developing
> > > > > > > > > auditory descriptions
> > > > > > > > > > >> for multimedia presentations.
> > > > > > > > > > >>
> > > > > > > > > > >> The Web Content Accessibility Guidelines
> > (WCAG) version
> > > > > > > > > 1.0 of W3C/WAI
> > > > > > > > > > >> indicates the possibility of using speech
> > synthesis for
> > > > > > > > > providing auditory
> > > > > > > > > > >> descriptions for multimedia presentations.
> > > > Specifically,
> > > > > > > > > checkpoint 1.3 of
> > > > > > > > > > >> WCAG 1.0 reads:
> > > > > > > > > > >>
> > > > > > > > > > >> "1.3 Until user agents can automatically
> > read aloud the
> > > > > > > > > text equivalent of
> > > > > > > > > > >a
> > > > > > > > > > >> visual track, provide an auditory
> > description of the
> > > > > > > > > important information
> > > > > > > > > > >> of the visual track of a multimedia presentation.
> > > > > > > [Priority 1]
> > > > > > > > > > >> Synchronize the auditory description with the audio
> > > > > > > track as per
> > > > > > > > > > >checkpoint
> > > > > > > > > > >> 1.4. Refer to checkpoint 1.1 for information about
> > > > > > > > > textual equivalents for
> > > > > > > > > > >> visual information." (WCAG 1.0, checkpoint 1.3).
> > > > > > > > > > >>
> > > > > > > > > > >> In the same document in the definition of
> > > > > > > "Equivalent", we read:
> > > > > > > > > > >>
> > > > > > > > > > >> "One example of a non-text equivalent is
> > an auditory
> > > > > > > > > description of the
> > > > > > > > > > >key
> > > > > > > > > > >> visual elements of a presentation. The
> > description is
> > > > > > > > > either a prerecorded
> > > > > > > > > > >> human voice or a synthesized voice (recorded or
> > > > > > > > > generated on the fly). The
> > > > > > > > > > >> auditory description is synchronized with the audio
> > > > > > > track of the
> > > > > > > > > > >> presentation, usually during natural pauses in
> > > > the audio
> > > > > > > > > track. Auditory
> > > > > > > > > > >> descriptions include information about
> > actions, body
> > > > > > > > > language, graphics,
> > > > > > > > > > >and
> > > > > > > > > > >> scene changes."
> > > > > > > > > > >>
> > > > > > > > > > >> My questions are as follows:
> > > > > > > > > > >>
> > > > > > > > > > >> 1. Does SMIL 2.0 support the development
> > of synthesized
> > > > > > > > > speech auditory
> > > > > > > > > > >> descriptions?
> > > > > > > > > > >>
> > > > > > > > > > >> 2. If the answer to question #1 is "Yes",
> > then briefly
> > > > > > > > > describe the
> > > > > > > > > > >support
> > > > > > > > > > >> that is provided.
> > > > > > > > > > >>
> > > > > > > > > > >> 3. If the answer to question #1 is "No",
> > then please
> > > > > > > > > describe any plans
> > > > > > > > > > >for
> > > > > > > > > > >> providing such support in the future.
> > > > > > > > > > >>
> > > > > > > > > > >> Thanks very much for your consideration.
> > > > > > > > > > >>
> > > > > > > > > > >> - Eric G. Hansen
> > > > > > > > > > >> Development Scientist
> > > > > > > > > > >> Educational Testing Service (ETS)
> > > > > > > > > > >> Princeton, NJ 08541
> > > > > > > > > > >> ehansen@ets.org
> > > > > > > > > > >> Co-Editor, W3C/WAI User Agent
> > Accessibility Guidelines
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > >
> >
 s
Received on Tuesday, 7 November 2000 15:13:52 UTC