captions and descriptions from Charles McCathieNevile on 2002-12-01 (w3c-wai-gl@w3.org from October to December 2002)

From: Charles McCathieNevile <charles@w3.org>
Date: Sun, 1 Dec 2002 05:24:53 -0500 (EST)
To: Joe Clark <joeclark@joeclark.org>
cc: WAI-GL <w3c-wai-gl@w3.org>
Message-ID: <Pine.LNX.4.30.0212010423021.664-100000@tux.w3.org>
On Mon, 25 Nov 2002, Joe Clark wrote:

>
>The teleconference minutes
><http://www.w3.org/WAI/GL/2002/11/14-minutes.html> say:
>
>>pb i don't need captions to understand video content, but i have
>>watched foreign movies and read the captions.
>
>Subtitles, you mean. Subtitles are not captions. Subtitles are not an
>accessibility feature *for people with disabilities*.

Yes, we need to remember the differences here, as well as the similarities.
In timed-text work that may be undertaken by W3C it might make sense to point
out the need for formats that can handle differentiation between captions and
subtitles, and a nice feature might be the possiblity of writing one combined
set of captions/subtitles that can be presented as either. (This isn't going
to be easy for a naive user to do, but then neither subtitling nor captioning
are easy for naive users either).

>The teleconference minutes list the participants. Could persons on
>that list who watch captions every day please identify themselves?
>
>How about those who have watched two hours of captions in the past
>week? And have done the same for every week in the past two months?

Oh. From time to time I qualify in that list (although I wasn't on that
teleconference)

>>gv usually, if you click on them you can pause them.
>>
>>gv in a videotape you can pause.
>
>You can't pause the captions independent of the video. Same with DVD.
>Isn't that the airy-fairy goal the Initiative is grasping at?
>
>>pb before getting into live streams, since that is a bit diff
>>concept, is there evidence that it is better to have simultaneous
>>caption and demo, or demo then caption.
>
>You are essentially asking for a new medium to be developed, one that
>brings the 19th-century usage of intertitles into the 21st century.
>The goal here is apparently to force filmmakers to create segmented
>animated slideshows that leapfrog caption tracks or that can be run
>in tandem with captions only if you opt into such a display.

Actually you are asking for people to use the systems developed in the late
20th century such as SMIL and SVG animation, which envisage precisely this
kind of use case.

The examples produced by NCAM allow you to pause the whole presentation, but
could be readily adapted to allow pausing of the video through an interaction
control written in SMIL, and therefore playable on a current-generation
player.

>WAI is pretty much saying that, irrespective of over 20 years of
>day-to-day usage of captions by hundreds of thousands of people,
>present-day captioning does not work. Captioning viewers, unbeknownst
>to themselves, are too stupid to be able to keep up with a picture
>and words presented simultaneously.
>
>Isn't this the typical reaction of caption-naive hearing people when
>presented with captioning for the first time? Especially if they're
>over 40? "Oh, my God! You can't expect me to keep up with all that!"

>Can an advocate of the proposed checkpoint tell me how it would apply
>to the following cinematic forms, all of which I have watched with
>captioning in the last month? I'm just going to assume that every
>single cooking show qualifies. I'm just going to assume that. Let's
>look at some other genres.
>
>* Dramatic feature film
>* Animated comedy
>* Dramatic TV series
>* Newscast
>* Talk show
>* Music video
>* Documentary
>* French-language film with English subtitles (yes, *and* captions)
>* Porn (not the kind I particularly like, but I've channel-surfed
>through it on the Movie Network)
>* Infomercial
>
>Could it possibly be true that the checkpoint addresses a
>hypothetical problem for a hypothetical user base that isn't even the
>primary audience for captioning and can be defended only through the
>use of a hypothetical example?
>
>Should I also mention that few existing online media players let you
>independently pause and run captions and video? In fact, I don't know
>of any-- at all. Perhaps Andrew W.K. knows of one. The point is
>nonetheless made: The function that this checkpoint demands does not
>presently exist-- for the simple reason that it is not needed. It is
>contrary to the way captions are actually used, as anyone who really
>uses captioning will confirm.
>
>
>>
>>Success criteria
>>
>>You will have successfully met Checkpoint 1.2 at the Minimum Level if:
>>
>>     1. an audio description is provided of all visual information in
>>        scenes, actions and events (that can't be perceived from the sound
>>        track).
>
>No. "That can't be *understood* from the soundtrack" (one word, not
>two). An unexplained bump may be audible, but its meaning may not be
>clear. Or a character may talk to person X rather than person Y,
>again unnoticeable through audio alone.

Agreed.
>
>>     2. all significant dialogue and sounds are captioned
>>     exception: if the Web content is real-time audio-only,

I don't understand why this is excluded. It may be one of the most common
cases where people look for an exemption from the requirement due to the
difficulty of meeting it, but I find it hard to understand why real-time
audio is different in what people need from it.

(I have a dream. I dream that one day little children will be able to listen
together to the great orators of their time together, even though one relies
on speech recognition and conversion to signed language, plus their ability
to use signed versions of spoken languuages as well as their own sign
languages...)

>Audio-only feeds are not a "time-dependent presentation" according to
>the definition:
>
>>>    A time-dependent presentation is a presentation which
>>>      * is composed of synchronized audio and visual tracks (e.g., a movie)
>
It would seem to me that time-dependent things are any which reauire
synchronising things.

>
>WAI did not quite understand that it was an inch away from requiring
>that every Web audio feed in the world be real-time-captioned. (No,
>radio stations in meatspace aren't captioned. They don't need to be;
>they don't have a visual form. Music on compact discs should not be
>captioned for the same reason; music videos *should* be captioned
>because they *do* have a visual component. Web-based audio feeds
>shouldn't have to be captioned, either. Oh, and has anyone realized
>yet that, just as Napster was an Internet application and not a Web
>application, Web radio is usually not a Web application either? This
>is the Web Accessibility Initiative; please limit your feature creep.)

Napster and Internet Radio are Web applications. Just not HTTP applications.
And while many of them won't get captioned in the near future, this just
keeps them inaccessible.

>>     4. if the Web content is real-time video with audio, real-time
>>        captions are provided unless the content:
>>           + is a music program that is primarily non-vocal
>
>Again, the WAI essentially condemns any online real-time videocaster
>to caption all its material. Is the WAI aware of just how difficult
>that is when using present-day software? It is *not* as easy as
>adding signals to Line 21. There isn't anything remotely resembling a
>standardized and reliable infrastructure set up for this task yet,
>all usages of Java applets, ccIRT, or other software notwithstanding.
>
>If the video feed also appears on TV, how do you propose to reroute
>the TV captions to online format? Or are you actually suggesting that
>each minute of video be separately captioned twice-- once for TV,
>once online?
>
>My previous point remains in place: A standalone video player does
>not necessarily have anything to do with the Web, really. It's an
>Internet application, not a Web application; the WAI has no scope or
>authority over it. Unless of course you'd like to retroactively
>redefine the mandate.
>

You seem to be misunderstanding what WAI does. Explaining how to make
something accessible (even in principle, where the technology is not readily
available) isn't claiming mandate, nor authority in the sense of some control
or power.

>>     5. if the Web content is real-time non-interactive video (e.g. a
>>        Webcam of ambient conditions), an accessible alternative is
>>        provided that achieves the purpose of the video.
>
>Really?
>
>How's that gonna work?
>
>If my Webcam exists to show the weather outside, how am I going to
>provide captions or descriptions that impart the same information?
>
>Or what if my Webcam is pointed at my aquarium, or my pet canaries,
>or the three adorable little kittens I got the other week? If the
>purpose of the Webcam is to let people *see* what's going on with the
>fish, birds, or cats, how do I automatically convert that to an
>accessible form? (Especially if there's a live microphone and the
>canaries sing or the kittens mewl?)
>
>Real-world example: Webcams in day-care centres so snoopy moms (and
>even dads) can watch what caregivers and children are doing. How does
>one automatically convert those ever-changing images to an accessible
>form?
><http://www.parentwatch.com/content/press/display.asp?p=p_0008>
>
>What if the Webcam's purpose is to tell people if I'm in the office
>or not? They look at the picture; if they see me, I'm in, and if they
>don't, I'm not. Are you saying I have to manually set a flag in some
>software that will send along some kind of text equivalent? I bought
>a Webcam to avoid having to do that. Webcams provide real-time
>information; interpretation is left to the viewer, not the author.
>There *is* no author in this scenario; it's an automated feed.
>
>I would say it would be fair to exempt nearly any Webcam that
>attempts to display ambient conditions. Perhaps that is a somewhat
>inadequate definition, but it beats what we've got now ("real-time
>non-interactive video"). An equivalent similar to alt="Webcam of my
>office" is clearly necessary if something like an <img> element is
>used, but beyond that, isn't the Web Accessibility Initative merely
>engaging in yet more gassy hypothesizing? People with next to no
>lived experience of captioning or description, and not much more
>experience with Webcams, are writing down guidelines that, in one
>imaginable turn of events, tens of thousands of Web sites would have
>to follow, perhaps under government sanction?

If the requirement to explain what a webcam shows is understood by any
organisation to be justification for closing down the large proportion who
cannot afford the technology (visual object recognition is not as complex or
expensive as it was even 3 years ago, but it is still beyond most people with
a $100 webcam) then I would agree that calling their policy application
half-baked is generous. On the other hand it is clear that if your webcam is
to let people know if you are in the office or not, and the text equivalent
provided is just "Schrodinger's office with or without cat" then you haven't
made the function accessible yet.

>I believe I've used the term "half-baked" already. Perhaps
>"half-arsed" is more in order.
>
>>     6. if a pure audio or pure video presentation requires a user to
>>        respond interactively at specific times in the presentation, then
>>        a time-synchronized equivalent (audio, visual or text)
>>        presentation is provided.
>
>Such presentations are not covered under the definition, nor,
>arguably, should they be.

I agree with you that the definition relied on may exclude these things, but
I disagree with your conclusions.

>Let's work through this scenario posited above, shall we?
>
>An audio presentation, which deaf people can't hear in the first
>place, tells us something like "Make your selection now." The
>checkpoint seems to require a balloon to pop up saying "Make your
>selection now." Selection about what? What are you talking about? I
>haven't heard anything!
>
>A video presentation, which blind people can't see in the first
>place, tells us something like "Pick a number from 1 to 10." The
>checkpoint seems to require a voice to somehow be made manifest
>saying "Pick a number from 1 to 10." Why pick a number? What are you
>talking about? I haven't seen anything!
>
>How is an all-audio device supposed to display something visually?
>How is an all-video device expected to display something auditorily?
>
>Has anyone spent even half a second thinking through these things?

A Deaf person is unlikely to use an all-audio device. Instread, in a scenario
such as using a VoiceXML interaction, the fact that it is a Web application
(has a URI) enables people to use an alternative browser suited to their
needs, that provides a text-based representation. (With the use of something
like Annotea it is possible to explain the interaction using visual symbols
for control)

>>     3. if Web content is an interactive audio-only presentation, the user
>>        is provided with the ability to view only the captions, the
>>        captions with the audio, or both together.
>
>I believe we have covered this in detail. Audio-only presentations
>are not included in the definition. Independent control of picture
>and captions (wait! now we want independent control of *sound*!) is
>presently impossible in practice and is unneeded.

It provides functionality that people are not used to having - more than what
is available with recent technology. It is certainly possible with
current-generation Web technology.

>>        You will have successfully met Checkpoint 1.2 at Level 3 if:
>>
>>     1. a text document (a "script") that includes all audio and visual
>>        information is provided.
>
>I believe the term is "transcript," and it does not "require"
>"quotation marks."
>
>Now, how does one include "all... visual information"? I thought this
>was a *text* document.

Yes, the wording should be something like "all information that is conveyed
by the combination of the audio and the video".

>You do realize that the only possible way to satisfy a surface
>reading of this requirement is to open-caption the video and print
>out every frame of it?

No, that is not true. Collecting the captions and a caption of the audio
description, and doing a merge to interleave them in timing is easy to
envisage with the current XML or XML-like formats used, and is on the surface
an obvious way to meet the reauirement.

>>     2. captions and Audio descriptions are provided for all live
>>        broadcasts which provide the same information.
>
>What does the word "which" have scope over in that sentence? Like so
>many other clauses in WCAG documents, it appears only to have been
>skimmed by the WAI and not actually read, let alone understood or
>reality-checked.

The only sensible reading is to assume that the which introduces an
adjectival clause governing the substantive phrase "captions and audio
descriptions". Perhaps wording it as "For all live broadcasts, captions and
audio descriptions which provide the same information are required" would
help. (I still don't understand why this is a special case from the
perspective of accessibility requirements. The obvious difficulty of doing it
with current technology is a policy question of what to require.)

>>    Note: Time-dependent presentations that require dual, simultaneous
>>    attention with a single sense can present significant barriers to some
>>    users.
>
>But they are *inevitable* in accommodating people who have only *one*
>sense to use.
>
>A nondisabled person can watch and listen simultaneously. A deaf
>person can only watch; a blind person can only listen. Whoa, big
>surprise-- to render speech in visible text adds something else to
>look at, and to render action in audible speech adds something else
>to listen to.
>
>Yes? And?
>
>I know that caption- and description-naive nondisabled people have a
>hard time keeping up at first, but really, do we want to codify such
>inadequacies in an official guideline?
>
>>Depending on the nature of the of presentation, it may be
>>    possible to avoid scenarios where, for example, a deaf user would be
>>    required to watch an action on the screen and read the captions at the
>>    same time.
>
>Why? That is the nature of captioning.

Well, of captioning as conceived twenty years ago. People who can work with
captioned content and not need to stop one or the other part will be able to
work more efficiently than those who cannot. But those who cannot keep up now
live in a world here the technology can halp them better than what was
possible a few years ago.

Chaals
Received on Sunday, 1 December 2002 05:24:58 UTC