W3C home > Mailing lists > Public > w3c-wai-gl@w3.org > October to December 2002

More absurdity with captions and descriptions

From: Joe Clark <joeclark@joeclark.org>
Date: Mon, 25 Nov 2002 17:44:14 -0500
Message-Id: <a05111a05b9fb31e296e9@[]>
To: WAI-GL <w3c-wai-gl@w3.org>

The teleconference minutes 
<http://www.w3.org/WAI/GL/2002/11/14-minutes.html> say:

>pb i don't need captions to understand video content, but i have 
>watched foreign movies and read the captions.

Subtitles, you mean. Subtitles are not captions. Subtitles are not an 
accessibility feature *for people with disabilities*.

>pb generally, i can read captions and not have loss of 
>understanding. even the cooking demo - is it necessary?
>pb i feel i could read captions and understand what was going on.

Well, here we go with the hypotheticals again.

The teleconference minutes list the participants. Could persons on 
that list who watch captions every day please identify themselves?

How about those who have watched two hours of captions in the past 
week? And have done the same for every week in the past two months?

Ah. I see.

As in so many other cases in the development of the WCAG, 
accessibility advocates errantly behave as if they are equally expert 
on all topics. Expertise can be cultivated-- I didn't know anything 
about Web accessibility five years ago, and now I know enough to have 
written a book about it. But the fact that one has an interest in 
accessibility, or work at a university research centre, or have a day 
job working on one specific aspect of accessibility, or are 
self-taught in one or other of the various issues, does not in itself 
qualify you to write guidelines on *a separate topic you don't 
actually know anything about*. (You can advance opinions; they are 
merely underinformed opinions. But writing actual guidelines requires 
a higher threshold.)

The claim above-- "I feel I could read captions and understand what 
was going on"-- is hypothetical. One person, using a mental example 
rather than lived experience, speculates that an example would be OK 
under certain vaguely-envisaged conditions. Captioning is all about 
details, not vagueness. And from this and other ill-informed 
contributions, an entire Web Content Accessibility Guideline may be 
written with repercussions that will remain in effect for up to a 

Shall we continue?

>gv for cooking shows, etc not usually a problem. only during a 
>detailed training.

I watch a lot of cooking shows-- _Martha Stewart Living_, _New 
Classics_, _Cook Like a Chef_, _Iron Chef_, _Nigella Bites_, but none 
of that Jamie Oliver nonsense. Of those that are captioned, I have no 
trouble watching the captions and following the show. I doubt anyone 
else does, either. Wasn't the first open-captioned show on an 
American network actually _The French Chef_? Do we not, in fact, have 
enormous experience watching cooking shows and captions 

Is this not the worst possible example, and not merely a hypothetical 
one? Does this example not prove the absurdity of the requirement?

>gv usually, if you click on them you can pause them.
>gv in a videotape you can pause.

You can't pause the captions independent of the video. Same with DVD. 
Isn't that the airy-fairy goal the Initiative is grasping at?

>pb before getting into live streams, since that is a bit diff 
>concept, is there evidence that it is better to have simultaneous 
>caption and demo, or demo then caption.

You are essentially asking for a new medium to be developed, one that 
brings the 19th-century usage of intertitles into the 21st century. 
The goal here is apparently to force filmmakers to create segmented 
animated slideshows that leapfrog caption tracks or that can be run 
in tandem with captions only if you opt into such a display.

Could somebody out there give me five present-day examples of such a 
cinematic form? Not hypothetical examples. I mean real products I 
could buy and watch, or URLs I could surf to and watch. Five examples 
of an actual film, video, TV show, or something else cinematic that 
either (a) actually shows pausable captions and video, and is 
expressly filmed and created for both of those, or (b) could be very 
well adapted to such a style?



>action pb do some follow-up research to answer question about 
>producing captions (check with ncam, gallaudet, etc.) re: checkpoint 

NCAM are not the only experts, and neither is Gallaudet. I would be 
careful about calling the usual suspects in such a case.

WAI is pretty much saying that, irrespective of over 20 years of 
day-to-day usage of captions by hundreds of thousands of people, 
present-day captioning does not work. Captioning viewers, unbeknownst 
to themselves, are too stupid to be able to keep up with a picture 
and words presented simultaneously.

Users of this accessibility technique have actually been inaccessible 
all along! Fools! When will the world *learn*?

Now, could this be projection?

Isn't this the typical reaction of caption-naive hearing people when 
presented with captioning for the first time? Especially if they're 
over 40? "Oh, my God! You can't expect me to keep up with all that!"

Or is there a kind of overcompensation in order? Since Web 
accessibility *mostly* means accessibility for the blind and 
visually-impaired (the group that needs the biggest accommodation), 
are people so conditioned to the needs of people who cannot see well 
that they forget that other people can see just fine? that they have 
perfectly adequate capacities for visual processing? that they have 
many years of experience watching a moving picture and captions 
simultaneously, with no significant problems whatsoever?

I believe there is an undercurrent here of accommodating people with 
learning disabilities. As we saw before with the (now discredited) 
plan to force all Web authors everywhere to add illustrations 
("non-text content") to every single page, allegedly because a few 
hypothetical dyslexics might be slightly less confused, there seems 
to be a goal of destroying an existing medium (that would be cinema) 
and one of its existing accessibility provisions (captioning) because 
it is claimed that some learning-disabled people cannot keep up.

Where's the evidence of the problem?

Where is the evidence that the proposed solution will actually cure 
the disease and not kill the patient?

Who is the primary audience for captioning? Is it deaf and 
hard-of-hearing people or is it learning-disabled people?

Is the Web Accessibility Initiative really sure of itself here? Are 
you very sure indeed that you want to destroy existing captioning 
methods that *work* for the main disability group for which they are 
intended in the half-baked desire to accommodate some hypothesized 
subsection of an unrelated disability group?

Now maybe the claim will be made that the people we're trying to 
accommodate are both deaf and learning-disabled. ("Whoops! Did we 
forget to mention that?") Well, I'm gonna ask again: Where's the 
evidence of the problem? Where is the evidence that the proposed 
solution will actually fix the problem?

>pb to address my own concern, i will do some research. perhaps more 
>appropriate to change the example.
>gv the example is definitely a problem.
>pb along w/the changed example, supply an example where this would not apply.

Let's go all the way. Let's cook us up a whole raft of examples.

Can an advocate of the proposed checkpoint tell me how it would apply 
to the following cinematic forms, all of which I have watched with 
captioning in the last month? I'm just going to assume that every 
single cooking show qualifies. I'm just going to assume that. Let's 
look at some other genres.

* Dramatic feature film
* Animated comedy
* Dramatic TV series
* Newscast
* Talk show
* Music video
* Documentary
* French-language film with English subtitles (yes, *and* captions)
* Porn (not the kind I particularly like, but I've channel-surfed 
through it on the Movie Network)
* Infomercial

Could it possibly be true that the checkpoint addresses a 
hypothetical problem for a hypothetical user base that isn't even the 
primary audience for captioning and can be defended only through the 
use of a hypothetical example?

Should I also mention that few existing online media players let you 
independently pause and run captions and video? In fact, I don't know 
of any-- at all. Perhaps Andrew W.K. knows of one. The point is 
nonetheless made: The function that this checkpoint demands does not 
presently exist-- for the simple reason that it is not needed. It is 
contrary to the way captions are actually used, as anyone who really 
uses captioning will confirm.

>db last week i did some research about the cognitive load. i found 
>some that spoke to the amount of time someone's eyes might spend 
>reading the captions.
>db i think it pretty clearly showed that captions take the majority 
>of the time.

Yes? And?

You put captions on a video piece and people spend time reading them. 
Astounding, isn't it?

But to the WAI, that isn't an ineluctable fact of the captioning 
medium. It's a problem that needs fixing.

Really, requiring video to be playable with paused captions or 
vice-versa is a lot like some producer calling up a captioning house 
and saying "I need to have this program captioned, but can you make 
it so there aren't any words on the picture? Because I don't really 
like that."

I thought I'd take a look at the actual WCAG 2.0 recommendation 
<http://www.w3.org/TR/WCAG20/>. It's rather appalling, frankly.

>Checkpoint 1.2 Provide synchronized media equivalents for time-dependent
>Success criteria
>You will have successfully met Checkpoint 1.2 at the Minimum Level if:
>     1. an audio description is provided of all visual information in
>        scenes, actions and events (that can't be perceived from the sound
>        track).

No. "That can't be *understood* from the soundtrack" (one word, not 
two). An unexplained bump may be audible, but its meaning may not be 
clear. Or a character may talk to person X rather than person Y, 
again unnoticeable through audio alone.

>           + The audio description should include all significant visual
>             information in scenes, actions and events (that can't be
>             perceived from the sound track) to the extent possible given
>             the constraints posed by the existing audio track (and
>             constraints on freezing the audio/visual program to insert
>             additional auditory description).

I don't get the last part. If you refer to the clumsily-named 
E-description concept <http://ncam.wgbh.org/edescription/>, where the 
viewer can pause the video to hear a very lengthy ("extended," or 
"E") description that provides much more information than an 
"interlude" description added during a pause in dialogue or other 
appropriate real-time moment, then the last bit doesn't make any 
sense. What's the "constraint"? Pausing the video to listen to the 
extended description *removes* constraints.

>     2. all significant dialogue and sounds are captioned
>     exception: if the Web content is real-time audio-only,

Audio-only feeds are not a "time-dependent presentation" according to 
the definition:

>>    A time-dependent presentation is a presentation which
>>      * is composed of synchronized audio and visual tracks (e.g., a movie)

Thank heavens for this.

WAI did not quite understand that it was an inch away from requiring 
that every Web audio feed in the world be real-time-captioned. (No, 
radio stations in meatspace aren't captioned. They don't need to be; 
they don't have a visual form. Music on compact discs should not be 
captioned for the same reason; music videos *should* be captioned 
because they *do* have a visual component. Web-based audio feeds 
shouldn't have to be captioned, either. Oh, and has anyone realized 
yet that, just as Napster was an Internet application and not a Web 
application, Web radio is usually not a Web application either? This 
is the Web Accessibility Initiative; please limit your feature creep.)

>     4. if the Web content is real-time video with audio, real-time
>        captions are provided unless the content:
>           + is a music program that is primarily non-vocal

Again, the WAI essentially condemns any online real-time videocaster 
to caption all its material. Is the WAI aware of just how difficult 
that is when using present-day software? It is *not* as easy as 
adding signals to Line 21. There isn't anything remotely resembling a 
standardized and reliable infrastructure set up for this task yet, 
all usages of Java applets, ccIRT, or other software notwithstanding.

If the video feed also appears on TV, how do you propose to reroute 
the TV captions to online format? Or are you actually suggesting that 
each minute of video be separately captioned twice-- once for TV, 
once online?

My previous point remains in place: A standalone video player does 
not necessarily have anything to do with the Web, really. It's an 
Internet application, not a Web application; the WAI has no scope or 
authority over it. Unless of course you'd like to retroactively 
redefine the mandate.

>     5. if the Web content is real-time non-interactive video (e.g. a
>        Webcam of ambient conditions), an accessible alternative is
>        provided that achieves the purpose of the video.


How's that gonna work?

If my Webcam exists to show the weather outside, how am I going to 
provide captions or descriptions that impart the same information?

Or what if my Webcam is pointed at my aquarium, or my pet canaries, 
or the three adorable little kittens I got the other week? If the 
purpose of the Webcam is to let people *see* what's going on with the 
fish, birds, or cats, how do I automatically convert that to an 
accessible form? (Especially if there's a live microphone and the 
canaries sing or the kittens mewl?)

Real-world example: Webcams in day-care centres so snoopy moms (and 
even dads) can watch what caregivers and children are doing. How does 
one automatically convert those ever-changing images to an accessible 

What if the Webcam's purpose is to tell people if I'm in the office 
or not? They look at the picture; if they see me, I'm in, and if they 
don't, I'm not. Are you saying I have to manually set a flag in some 
software that will send along some kind of text equivalent? I bought 
a Webcam to avoid having to do that. Webcams provide real-time 
information; interpretation is left to the viewer, not the author. 
There *is* no author in this scenario; it's an automated feed.

I would say it would be fair to exempt nearly any Webcam that 
attempts to display ambient conditions. Perhaps that is a somewhat 
inadequate definition, but it beats what we've got now ("real-time 
non-interactive video"). An equivalent similar to alt="Webcam of my 
office" is clearly necessary if something like an <img> element is 
used, but beyond that, isn't the Web Accessibility Initative merely 
engaging in yet more gassy hypothesizing? People with next to no 
lived experience of captioning or description, and not much more 
experience with Webcams, are writing down guidelines that, in one 
imaginable turn of events, tens of thousands of Web sites would have 
to follow, perhaps under government sanction?

I believe I've used the term "half-baked" already. Perhaps 
"half-arsed" is more in order.

>     6. if a pure audio or pure video presentation requires a user to
>        respond interactively at specific times in the presentation, then
>        a time-synchronized equivalent (audio, visual or text)
>        presentation is provided.

Such presentations are not covered under the definition, nor, 
arguably, should they be.

Let's work through this scenario posited above, shall we?

An audio presentation, which deaf people can't hear in the first 
place, tells us something like "Make your selection now." The 
checkpoint seems to require a balloon to pop up saying "Make your 
selection now." Selection about what? What are you talking about? I 
haven't heard anything!

A video presentation, which blind people can't see in the first 
place, tells us something like "Pick a number from 1 to 10." The 
checkpoint seems to require a voice to somehow be made manifest 
saying "Pick a number from 1 to 10." Why pick a number? What are you 
talking about? I haven't seen anything!

How is an all-audio device supposed to display something visually? 
How is an all-video device expected to display something auditorily?

Has anyone spent even half a second thinking through these things?

Can proponents of this requirement-- indeed, of every single 
requirement everywhere in WCAG 2.0-- provide five real-world examples 
available right now to which the requirements would apply? And also 
explain how to make it work, using today's tools?

>You will have successfully met Checkpoint 1.2 at Level 2 if:
>     1. the site has a statement asserting that the audio description has
>        been reviewed and it is believed to include all significant visual
>        information in scenes, actions and events (that can't be perceived
>        from the sound track) is provided to the extent possible given the
>        constraints posed by the existing audio track (and constraints on
>        freezing the audio/visual program to insert additional auditory
>        description).

Check the grammar there: "Is provided."

Now, what does this mean? Can somebody name five examples of audio 
description that are not reviewed in some way?

Does this checkpoint, and the others like it elsewhere in WCAG 2.0, 
not in fact countenance some kind of accessibility classification 
board that would pre-screen descriptions, text equivalents, etc. 
before the site would be permitted to go online?

Does it not assume that every page on the Web is lovingly crafted and 
overseen by a conscientious, caring, sharing human being? Has no one 
ever heard of dynamically-generated pages, like the tens of thousands 
of mailing-list-archive pages at W3.org? Who's gonna manage all those?

>     2. captions and Audio descriptions are provided for all live
>        broadcasts that are professionally produced.

What does professional production have to do with anything? How are 
you defining that?

Why is "Audio descriptions" capitalized, and why has this error gone 
unfixed through three or four WCAG 2.0 drafts? (I was wondering if 
anyone would ever bother to read their own guidelines closely enough 
to notice. Apparently not. Bit of a recurring problem there.)

>     3. if Web content is an interactive audio-only presentation, the user
>        is provided with the ability to view only the captions, the
>        captions with the audio, or both together.

I believe we have covered this in detail. Audio-only presentations 
are not included in the definition. Independent control of picture 
and captions (wait! now we want independent control of *sound*!) is 
presently impossible in practice and is unneeded.

>        You will have successfully met Checkpoint 1.2 at Level 3 if:
>     1. a text document (a "script") that includes all audio and visual
>        information is provided.

I believe the term is "transcript," and it does not "require" 
"quotation marks."

Now, how does one include "all... visual information"? I thought this 
was a *text* document.

You do realize that the only possible way to satisfy a surface 
reading of this requirement is to open-caption the video and print 
out every frame of it?

>     2. captions and Audio descriptions are provided for all live
>        broadcasts which provide the same information.

What does the word "which" have scope over in that sentence? Like so 
many other clauses in WCAG documents, it appears only to have been 
skimmed by the WAI and not actually read, let alone understood or 

>The following are additional ideas for enhancing a site along this 
>particular dimension:

Please don't.

>Definitions (informative)
>      * captions are text equivalents of auditory information from speech,
>        sound effects, and ambient sounds that are synchronized with the
>        multimedia presentation.

Text equivalents?

You mean like alt texts, where the text equivalent must equate with 
the *function* of the graphic?

So the captions must have the same *function* as the audio?

Don't you mean captions are *written words* that *transcribe* speech 
and notate and render significant non-speech information?

>      * audio descriptions are equivalents of visual information from
>        actions, body language, graphics, and scene changes that are
>        voiced (either by a human or a speech synthesizer) and
>        synchronized with the multimedia presentation.

What is an "equivalent of visual information"?

Isn't a description a *description*, not an equivalent?

Why is "body language" specifically enumerated? In practice, body 
language is difficult to describe and circumstances rarely make it 
possible to give a description of it *first*. "Stands tensely at 
attention" is less important than "stands at attention."

>Benefits (informative)
>      * People who are deaf or have a hearing loss can access the auditory
>        information through the captions.
>      * People who are blind or have low vision as well as those with
>        cognitive disabilities who have difficulty interpreting visually
>        what is happening benefit from the audio descriptions of the
>        visual information.

More horrid, strangulated grammar. WAI really needs to get a handle 
on that; I've sat in meetings where people debated the surface 
meaning of WCAG 1.0 provisions because they were so badly written.

There is *extremely* limited evidence that people with learning 
disabilities genuinely benefit from descriptions.

>    Note: Time-dependent presentations that require dual, simultaneous
>    attention with a single sense can present significant barriers to some
>    users.

But they are *inevitable* in accommodating people who have only *one* 
sense to use.

A nondisabled person can watch and listen simultaneously. A deaf 
person can only watch; a blind person can only listen. Whoa, big 
surprise-- to render speech in visible text adds something else to 
look at, and to render action in audible speech adds something else 
to listen to.

Yes? And?

I know that caption- and description-naive nondisabled people have a 
hard time keeping up at first, but really, do we want to codify such 
inadequacies in an official guideline?

>Depending on the nature of the of presentation, it may be
>    possible to avoid scenarios where, for example, a deaf user would be
>    required to watch an action on the screen and read the captions at the
>    same time.

Why? That is the nature of captioning.

Why is the Web Accessibility Initiative trying to define twenty years 
of captioning practice out of existence?

     Joe Clark | joeclark@joeclark.org
     Accessibility <http://joeclark.org/access/>
     Weblogs and articles <http://joeclark.org/weblogs/>
     <http://joeclark.org/writing/> | <http://fawny.org/>
Received on Monday, 25 November 2002 17:44:27 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 16 January 2018 15:33:43 UTC