Comments on Multimedia and Audio

Date: 28 December 1999, 00:48 hrs
To: CG, AU, UA Lists
Subject: Comments on Multimedia and Audio

This memo includes Eric Hansen's responses to comments by Wendy Chisolm and 
Madeleine Rothberg regarding a thread entitled "Revised Checkpoints: 
WCAG(1.4/1.3) and UAAG(2.5)" (see 
http://lists.w3.org/Archives/Public/w3c-wai-gl/1999OctDec/0165.html). Related 
threads include "[CG] Captions for audio clips" and "[UA] Issue 138".

SECTION 1 - Eric Hansen's comments on Wendy Chisolm's comments on Eric 
Hansen's comments.

Key: 

EH1: Original memo by Eric Hansen
WC: Wendy Chisolm (16 Dec 1999)
EH2: Eric Hansen (28 Dec 1999)


EH1::
>7. WAI should develop one or more specification documents (W3C Notes or 
>Recommendations) for:
>
>a. auditory descriptions, including (1) synthesized-speech auditory 
>descriptions and (2) prerecorded auditory descriptions (including 
>"prerecorded auditory description tracks" and "prerecorded auditory 
>description supplement tracks", the latter being explained later in this 
>document)
>b. captions
>c. synchronization of collated text transcripts
>d. synchronization of audio clips with their text transcripts
>
>I see document "c" as possibly encompassing "a" and "b". Even better 
>perhaps, they could all four items could be addressed together. (I am not 
>sure whether all these are within the charter of SMIL.)
>
>I would not expect the task of retrofitting existing "prerecorded auditory 
>descriptions tracks" to the specifications to be difficult. Content and 
>data in existing captions could, I expect, be almost entirely reused in 
>new captions conforming to the captions specification.
WC::
This sounds like something that would be in the Techniques document, 
particularly documented in a SMIL-specific section/chapter.

EH1::
>====
>2. Avoid the use of "synchronized alternative equivalents" in WCAG.
>
>The term seems redundant.
WC:: ok

EH1:: ok

>3. Avoid the use of "synchronized equivalents" in both WCAG and UAAG.
>
>This is important because often the components to that are presented 
>together are not equivalent to each other. The term seems misleading.
>====
>
>4. Use the term "synchronized alternatives".
>
>Implies the idea that it is alternative content, which is essentially 
>true. This is my preferred term, I think.
>====
WC::
In some cases it is an alternative, but in others it is an equivalent (for 
example captions of speech, alt-text of bitmap image).  Also, I thought 
that we had decided to replace "alternative" with "equivalent" if the 
"alternative" was providing the _functional_ equivalent.  I suggest we 
stick with the term "synchronized equivalent."

EH2:: I have changed my mind and agree with you about using the term 
"synchronized equivalent". See definition in "Terminology, Etc." memo.

====

EH1:
>5. Use "visual track" and "auditory track"

>
>Use "visual track" and "auditory track" rather than video track and audio 
>track when referring to multimedia presentations.

WC:: ok

EH2:: Thanks…

====

EH1::
>
>6. Avoid the term "continuous alternatives".
>
>Not sure that this is a great term. It is probably best just to name the 
>specific things.

WC:: I did not see this in WCAG (guidelines nor Techiques), this must be a 
UAGL issue?

EH2:: It was an UA issue and I think that it was taken care of. We ought not 
introduce terms that are not necessary.
====
EH1::
>7. Add synchronization to the glossary.
>
>"Synchronization, Synchronize, Synchronization Data, Synchronized 
>Alternatives"
>
>"Synchronization refers to sensible time-coordination of two or more 
>presentation components, particularly where at least one of the components 
>is a multimedia presentation (e.g., movie or animation) or _audio clip_ or 
>a portion of the presentation or audio clip."
>
>"For Web content developers, the requirement to synchronize means to 
>provide the data that will permit sensible time-coordinated presentation 
>by a user agent. For example, Web content developer can ensure that the 
>segments of caption text are neither too long nor too short and that they 
>mapped to segments of the visual track that are appropriate in length.

WC:: I can see adding this part to WCAG glossary.

EH1:: Based on recent discussion, I suggest the following words:

"Synchronization, Synchronized Equivalents"

"_Synchronization_ refers to sensible time-coordination of two or more 
presentation components, particularly where at least one of the components is 
a multimedia presentation (e.g., movie or animation) or an _audio 
presentation_."

"For Web content developers, the requirement to synchronize means to provide 
the data that will permit sensible time-coordinated presentation of content 
by a user agent. For example, Web content developer can ensure that the 
segments of caption text are neither too long nor too short and that they 
mapped to segments of the visual track that are appropriate in length."

A _synchronized equivalent_ is an equivalent that is synchronized with some 
other component, particularly, the visual track or auditory track of a 
multimedia presentation or an audio-only presentation. The most prominent 
synchronized equivalents are auditory description and captions. An 'auditory 
description' is considered a synchronized equivalent because it is 
synchronized with the auditory and visual tracks of a multimedia 
presentation. 'Captions' are also considered synchronized equivalents because 
the are synchronized with the auditory track or audio presentation. A 
collated text transcript to which synchronization information has been added 
may be similarly presented in synchronization with the auditory and visual 
tracks of a multimedia presentation.

======

>"The idea of "sensible time-coordination" of components centers of the 
>idea of simultaneity of presentation, but also encompasses strategies for 
>handling deviations from simultaneity resulting from a variety of causes.
>
>Consider how certain deviations in simultaneity might be handled in 
>auditory descriptions. Auditory descriptions are considered synchronized, 
>since each segment of description audio is presented at the same time as a 
>segment of the auditory track, e.g., a natural pause in the spoken 
>dialogue. Yet a deviation can arise when a segment of the auditory 
>description is lengthy enough that it cannot be entirely spoken within the 
>natural pause. In this case there must be a strategy for dealing with the 
>mismatch between the description and the pause in the auditory track. The 
>two major types of auditory descriptions lend themselves to different 
>strategies. Prerecorded auditory descriptions usually deal with such 
>mismatches by spreading the lengthy auditory description over more than 
>one natural pause. When expertly done, this strategy does not ordinarily 
>weaken the effectiveness of the overall presentation. On the other hand, a 
>synthesized-speech auditory description lends itself to ot!
>!
>!
>her strate
>    gies. Since synthesize {12/17/99 Note by Eric Hansen: I think that this 
scrambled material was simply not finished in the original. It is a loose 
end…}
>
>Let us briefly consider how deviations might be handled for captions.
>
>Captions consist of a text equivalent of the auditory track that is 
>synchronized with the visual track. Captions are essential for individuals 
>who require an alternative way of accessing the meaning of audio, such as 
>individuals who are deaf. Typically, a segment of the caption text appears 
>visually near the video for several second while the person reads the 
>text. As the visual track continues, a new segment of the caption text is 
>presented.
>
>One problem arises if the caption text is longer than can fit in the 
>display space. This can be particularly difficult if due to a visual 
>disability, the font size has been enlarged, thus reducing the amount of 
>caption text that can be presented. The user agent must respond sensibly 
>to such problems, such as by ensuring that the user has the opportunity to 
>navigate (e.g., scroll down or page down) through the caption segment 
>before proceeding with the visual presentation and presenting the next 
>segment. Some means must be provided to allow the user to signal that the 
>presentation may resume.
>
>=====
WC::
some of this seems appropriate for the Techniques document, other pieces 
are obviously intended for the User Agent Guidelines glossary. They could 
be reworked for discussion in WCAG Techniques, or could be linked to from 
WCAG Techniques.

EH2:: You are welcome to use as you see fit.

>PART 3 -- CHANGES TO WCAG DOCUMENT
>
>1. Add checkpoints 1.3 into checkpoint 1.4 and then break 1.4 into several 
>checkpoints.

WC::

I am deleting much of your text and commenting on certain pieces of it.  In 
general, I feel that much of what is being incorporated into checkpoint 
text is more appropriate in the Techniques document.

I propose 1 new checkpoint, and reworking 1.3 and 1.4 to cover the 6 that 
Eric proposed.  1.3 is discussed here, 1.4 and 1.x are discussed later.

<checkpoint-proposal>
1.3 Provide a synchronized auditory description for each multimedia 
presentation (e.g., movie or animation).  [Priority 1 for important 
information, Priority 2 otherwise.]
</checkpoint-proposal>

The techniques for satisfying this checkpoint will be discussed in the 
Techniques document:
1. synchronizing a pre-recorded human auditory track.
2. synchronizing a recorded speech synthesized auditory track.
3.  synchronizing a text file on the fly.

I believe your proposed checkpoints 1.4.A and 1.4.B are techniques for 
checkpoint 1.3.

EH2::

As a reminder, here is the current checkpoint 1.3:

WCAG 1.0 (5 May 1999) checkpoint 1.3:
"1.3 Until user agents can automatically read aloud the text equivalent of 
a visual track, provide an auditory description of the important 
information of the visual track of a multimedia presentation. [Priority 1] 
Synchronize the auditory description with the audio track as per checkpoint 
1.4. Refer to checkpoint 1.1 for information about textual equivalents for 
visual information."

And here is my 4 December refinement of it:

New WCAG checkpoint 1.4.A (4 December 1999):
"1.4.A Until user agents can produce synthesized-speech auditory 
descriptions, provide an auditory description of _important information_ for 
each multimedia presentation (e.g., movie or animation). [Priority 1]"

And here is your suggestion:

<Wendy's-checkpoint-proposal>
1.3 Provide a synchronized auditory description for each multimedia 
presentation (e.g., movie or animation).  [Priority 1 for important 
information, Priority 2 otherwise.]
</Wendy's-checkpoint-proposal>

EH2:

A few comments about your proposal.
1. The split priority (Priority 2 for "otherwise") gives the checkpoint a 
higher overall priority than it currently enjoys. This may be warranted but 
should be taken to the working group.
2. The absence of the "until user agents" clause makes it a permanent 
checkpoint, otherwise it would expire. This is warranted.
3. The term "synchronized auditory description" is redundant because 
synchronization is already part of the definition of auditory description. 
This needs to be fixed.
4. I have some concern about relegating my checkpoints 1.4.A and 1.4.B to the 
techniques. I might feel different if I knew how the SMIL capabilities 
related to these. It seems to me that WAI could do more to define 
specifications for these different kinds of auditory descriptions. I would 
like to hear additional opinions on this.
5. In conclusion, I still like my proposal for checkpoint 1.4.A.

====
For background for the reader of this memo, here is my 4 Dec version of 1.4.B:

WCAG checkpoint 1.4.B (4 December 1999) (id: WC-SSAD):
"1.4.B For each multimedia presentation, provide data that will produce a 
synthesized-speech auditory description. [Priority 1]"
"Note: This checkpoint becomes effective one year after the release of a W3C 
specification for synthesized-speech auditory descriptions."

By the way, as far as I know Madeleine's suggestion regarding "synthesized 
auditory equivalent" is probably better than my term "synthesized-speech 
auditory equivalent". It is briefer.

Here is the new 28 Dec 1999 version of checkpoint 1.4.B:

"1.4.B For each dynamic audio/visual presentation {or movie or animation}, 
provide data that will produce a synthesized auditory description. [Priority 
1]"
"Note: This checkpoint becomes effective one year after the release of a W3C 
specification for synthesized-speech auditory descriptions."

By the way, as far as I know Madeleine's suggestion regarding "synthesized 
auditory equivalent" is probably better than my term "synthesized-speech 
auditory equivalent". It is briefer.

I use the term dynamic audio/visual presentation instead of "multimedia 
presentation" since there has been some discussion of changing the term 
multimedia to include audio-only presentations. I am not sure what I think 
about that proposal. I think that we ought to be cautious. I realize that 
this information may be somewhat dated, but a 1990 book by Bergman and Moore 
(Managing Interactive Video/Multimedia Projects) says: 

"Even the words 'interactive video' and 'multimedia' can cause confusion. For 
several years, the videodisc was the only source of motion video segments 
that could be accessed rapidly to support effective interactivity. Hence the 
term applied to these applications came to be 'interactive videodisc,' or 
more commonly, 'IVD.' Recently, digital technology has made it possible to 
provide motion video using other devices, especially the small optical discs 
called CD-ROM. Another factor has been the development of image-based 
applications that use graphic pictures and digital audio, and no motion video 
at all. The term 'multimedia' has been adapted as a generic reference to all 
such image-based applications."

Thus, to term audio-only presentations as form of "multimedia" doesn't seem 
to fit this 1990 definition. I'd like to hear other opinions.
 
>====
>New WCAG checkpoint 1.4.C (4 December 1999):
>"1.4.C For each multimedia presentation (e.g., movie or animation), 
>provide captions and a collated text transcript. [Priority 1]"
>
>Rationale: These two pieces are essential (captions for individuals who 
>are deaf; collated text transcript for individuals who are deaf-blind). We 
>know that captions are needed and we have technologies that can handle it. 
>A collated text transcript is relatively straightforward to supply.

WC::
this is a rewording of 1.4.  To make it jive with my proposed rewording of 
1.3 I propose:
<checkpoint-proposal>
1.4 Provide captions and a collated text transcript for each multimedia 
presentation (e.g., movie or animation).  [Priority 1]
</checkpoint-proposal>

EH2:: This looks fine to me, unless you lump audio-only presentations in 
multimedia, in which it might become:

1.4 Provide captions and a collated text transcript for each dynamic 
audio/visual presentation (e.g., movie or animation). [Priority 1]

>====
>New WCAG checkpoint 1.4.D (4 December 1999) (id: WC-ACLIP-TT):
>"1.4.D  For each audio clip, provide a text transcript. [Priority 1]"
>
>Rationale: A text transcript is _essential_ for disability access to audio 
>clips, whereas a text transcript is not essential for access to auditory 
>tracks of multimedia presentations (for example, the collated text 
>transcript and caption text includes the information found in the text 
>transcript of the auditory track).
>====

WC::
this is covered in the current checkpoint 1.1
<current-checkpoint>
1.1 Provide a text equivalent for every non-text element (e.g., via "alt", 
"longdesc", or in element content). This includes: images, graphical 
representations of text (including symbols), image map regions, animations 
(e.g., animated GIFs), applets and programmatic objects, ascii art, frames, 
scripts, images used as list bullets, spacers, graphical buttons, sounds 
(played with or without user interaction), stand-alone audio files, audio 
tracks of video, and video. [Priority 1]
</current-checkpoint>

EH2:: OK

>New WCAG checkpoint 1.4.E (4 December 1999) (id: WC-ACLIP-SYNC-TT):
>"1.4.E  Synchronize each audio clip with its text transcript. [Priority 
>1]" {I prefer the brevity of this version.}
>{or}
>"1.4.E  For each audio clip, provide data that will allow user agents to 
>synchronize the audio clip with the text transcript. [Priority 1]"
>"Note: This checkpoint becomes effective one year after the release of a 
>W3C recommendation addressing the synchronization of audio clips with 
>their text transcripts."
WC::
I agree with discussion on the list that "audio" should be included in 
"multimedia."  However, there was consensus that this ought to be a 
Priority 2.  Therefore, I propose:
<checkpoint-proposal>
1.x Provide captions for each stand-alone audio clip or stream, as 
appropriate. [Priority 2]
Note. For short audio clips, providing a text equivalent as discussed in 
checkpoint 1.1 is all that is needed.  This checkpoint is intended to cover 
audio clips of speech such as news broadcasts or a lyrical performance.
</checkpoint-proposal>
the "as appropriate" is supposed to signify that it is not necessary to 
caption all audio clips.  for example, we discussed back in May that we do 
not need to caption an instrumental performance, however it is appropriate 
to caption a musical performance with singing.

EH2:: I am willing to consider a change that would make "audio-only 
presentations" part of "multimedia presentations". The earlier decision 
(spring 1999) was to keep them separate. Regardless of whether the 
definitions are combined, there will be different checkpoints for the two, 
since the priorities are different. I would like to get other views as to 
whether audio-only presentations are really considered "multimedia." I 
suppose then, that multimedia presentations would include movies, animations, 
audio-only presentations, but not short sounds. See earlier discussion in 
this memo regarding this issue.

I would suggest the following wording:

Eric's 28 December suggestion:

"1.x Provide captions for each word-using audio presentation [Priority 2]"
 
Rationale: I think that the term "audio presentation" seems better than 
"audio clip". I think that it is easier for us to make our own definition of 
audio presentation than of audio clip. I think that the definition can make 
clear the intended scope of the checkpoint.

I have added the new term "word-using" to exclude instrumental performances 
from the requirement.

This material relates to a thread on "captions for audio". 

Do you have a URL for notes on the decision not to caption musical 
performance? Did the decision also address text equivalents of musical 
performance?

Here is a possible definition of audio presentation.

"Audio Presentation"
"Examples of audio presentations include a musical performance, a radio-style 
news broadcast, or a book reading. The term "audio presentation" is 
contrasted with "short sounds". A "word-using audio presentation" is one that 
uses words, such as a musical performance with lyrics, in contrast to musical 
performance that only uses musical instruments."

Assuming that musical scores are not required for instrumental music, WCAG 
checkpoint 1.1 should contain a note such as the following.

"Note. The requirement of a text equivalent for a musical performance does 
not include a requirement for musical scores." 

=====
EH1::

>New WCAG checkpoint 1.4.F

>"For each multimedia presentation for which a synthesized-speech auditory 
>description of _important_ information is likely to be inaccessible, 
>provide a prerecorded auditory description _important_ information."
>"[Priority 3]"
>{or}
>"For each multimedia presentation, provide a prerecorded auditory 
>description."
>"[Priority 3]"
>{or}
>"For each multimedia presentation, provide a prerecorded auditory 
>description for _important_ information."
>"[Priority 3]"
WC::  If synthesizing auditory descriptions is a technique for 1.3, then 
this proposed checkpoint is not needed.

EH2::  I have heard the opinion expressed that prerecorded auditory 
descriptions are preferred in some settings and am of the opinion that this 
should stay. I would like to hear additional opinions on this. I think that 
in some settings prerecorded auditory description are felt to be helpful. I 
think that this deserves discussion on the list.
==========================

SECTION 2 - Eric Hansen's comments on Madeleine Rothberg's 21 Dec 1999 
comments on Eric Hansen's comments. (re: Issue #138)

EH1: Eric's earlier commments
EH2: Eric's Hansen's 28 December 1999 comments
MR: Madeleine Rothberg's 21 December 1999 comments


MR::

Here are my comments on issue 138
http://cmos-eng.rehab.uiuc.edu/ua-issues/issues-table.html#138

I do not have any strong opinions on the use of the terms
"synchronized alternative equivalents",  "synchronized 
equivalents", "synchronized alternatives", "continuous 
equivalents."
Some alternatives are synchronized and some are not, but if we 
make clear which are which perhaps we can use the same terms for 
both. I don't see the difference between "alternative" and 
"equivalent," so I am happy to let the editorial types make the 
decision on this part of the issue.

I do have comments on other parts of Eric's proposal. I agree with 
Wendy's comments to the GL list archived at:
http://lists.w3.org/Archives/Public/w3c-wai-gl/1999OctDec/0218.html
Many of Eric's proposals for the WCAG involved splitting a single 
checkpoint into several checkpoints. Wendy commented that she felt 
the material could be incorporated into techniques instead. I 
think we can take a similar approach for the UAAG, and that much 
of Eric's analysis would make excellent techniques information. 
Specifically:

EH::
I think that there are a huge number of ways in which text, video, 
audio and their equivalents _could be combined_ to make multimedia 
presentations and audio clips accessible to people with 
disabilities, but only a much smaller number of ways are really 
essential or really valuable and it is up to WAI to more 
specifically identify and describe that smaller number of 
combinations.

MR::
I agree that certain combinations of multimedia tracks are more 
likely than others to be useful, but I think that existing UA 
checkpoints say that all tracks must be able to be turned on and 
off. This gives the user complete control over which tracks are 
rendered, making it unnecessary for the UA to understand the 
combinations.  This would include 2.1 "Ensure that the user has 
access to all content, including alternative equivalents for 
content. [Priority 1] " and also 2.5 "If more than one alternative 
equivalent is available for content, allow the user to choose from 
among the alternatives. This includes the choice of viewing no 
alternatives. [Priority 1]" as well as checkpoints in Guideline 3 
that specify that users be able to turn on and off any track since 
it might cause distraction or otherwise interfere with use. I 
think Eric's excellent description of the uses of different 
combinations of tracks would be helpful techniques material so 
that UA developers see the reason to implement the checkpoints 
listed here.

Eric's analysis includes distinguishing between text transcripts, 
text of audio description, and collated text transcripts (which 
are a combination of the other two).  The use of a collated text 
transcript is a neat idea, but it is not yet a part of any 
specification, so I don't think we can shape our guidelines around 
it. Similarly, both the WCAG and the UAAG would like to support 
the idea of synthesized speech for rendering of audio descriptions 
from a text file, but we do not have a technology that can do 
that. Another possible synchronized equivalent that does not have 
an implementation yet is a sign language track. Though I've argued 
in the past that sign language is an important equivalent (and I 
still feel that it is) I acknowledge that unless SMIL or some 
other specification has a way for authors to indicate that a given 
track is intended as an equivalent track, we can't require UAs to 
allow that track to be turned on and off in the same way that we 
can require for captions and audio descriptions (defined as of the 
latest public draft of SMIL-Boston).

Overall, what I'm trying to say is, we need to craft some forward 
looking language, probably in the techniques, to promote new 
ideas. This would include synthesized audio descriptions, 
combining captions and audio descriptions into a collated text 
transcript (which can then replace both tracks) and a way to 
indicate that a video track is intended as an alternative 
equivalent, for sign language use. But until then, I think we are 
best off with the current umbrella checkpoints referring to 
various kinds of media, with techniques showing the currently 
recognized ways of implementing them as well as future ideas for 
improved features.

EH2::

I think that the Techniques document is a good place to discuss the value of 
providing a movie or animation (such as of a sign language translation) that 
can be synchronized with any other media type (text, audio presentation, 
visual track, auditory track, or movie or animation.)

MR::

I think this approach matches the spirit of our changes in the 
December 7 telecon, where we resolved to merge the checkpoints in 
GL 4 for audio, video, and animation into a single set of 
checkpoints. Whenever possible, I think we are better off with 
fewer checkpoints as long as they are clear. The use of examples 
and Notes helps with that clarity. I don't think we need a series 
of checkpoints on each different aspect of alternative tracks.

<END OF MEMO>

Received on Tuesday, 28 December 1999 01:30:13 UTC