W3C home > Mailing lists > Public > www-style@w3.org > September 2008

Re: Acessibility of <audio> and <video>

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Thu, 04 Sep 2008 13:22:37 +0200
Message-ID: <48BFC4FD.7020406@lachy.id.au>
To: Dave Singer <singer@apple.com>
Cc: public-html@w3.org, W3C WAI-XTECH <wai-xtech@w3.org>, www-style@w3.org

Dave Singer wrote:
> 2.2 Associated with the media
> 2.2.1 Introduction
> There are also needs to associate data with the media, rather than embed it 
> within the media. The Web Content Accessibility Guidelines, for 
> example, request that it be possible to associate a text transcript with timed 
> media. Sometimes even, for very short media elements, alternative text may be 
> enough (e.g. "a dog barks").
> Finally, we need to consider what should happen if source selection fails: none 
> of the media file sources are considered suitable for this user-agent and user. 
> What is the fallback in this case?

It should pick the closest match available, even if not all conditions 
were met.

> The first two following are taken from the current state of IMG tagging in HTML5 
>       2.2.2 alt
> It's probably much more rarely useful than on images, but as noted above, there 
> may be some small media files which are semantically significant which can be 
> described with a short text string (e.g. "a dog barks"), which could be placed 
> in an alt attribute.

OK, for that use case, it seems reasonable to be able to provide a short 
description in some way.  I'm not necessarily agreeing that it should be 
the alt attribute, that's just one possible solution to consider.  I 
think we need to find and document examples of the kind of videos for 
which such a short alternative text would be appropriate.

However, it needs to be clear that it is to be an alternative for the 
video, not, as Leif tried to suggest earlier in this thread, an 
alternative for just the poster frame.

> 2.2.3 longdesc
> The longdesc attribute, when used, takes a URI as value, and links to a 'long 
> description'. It is probably the attribute to use to link to such things as a 
> transcript (though a transcript is more of a fulltext alternative than a 
> description).

The longdesc attribute is not included for the img element.  It has been 
clearly demonstrated in past discussions that it is a complete failure 
in practice and pursuing it as a solution for video is, IMO, a waste of 
time.  Plus, I have already explained why any sort of long description, 
whether it be a transcript, full text alternative, or whatever else, is 
useful to more people than just those with accessibility needs.  Any 
links to a long description should be done using ordinary, visible links 
from within the surrounding content.

> 2.2.4 fallback content (video not supported vs. no source is suitable)
> As noted above, the proposal that we add to the criteria to select a source 
> element further highlights the open question about today's specification: the 
> fallback content within media elements is designed for browsers not implementing 
> audio/video. It is probably inappropriate to overload that use with the case 
> when the browser does implement media elements, but no source is appropriate.

I think the right approach here is for the browser to allow the user to 
either save or launch the video in an external media player.

> 3. In-media Selecting/Configuring
> 3.1 Introduction
> We propose considering the accessibility needs as a set of independent 'axes', 
> for which the user can express a clear need, and for which a media element can 
> express a clear ability to support, inability to support, or lack of awareness.
> The user preferences are two-state: 'I need accessibility X', 'I have no 
> specific need for accessibility X'. For un unstated preference 'no specific 
> need' is assumed.
> The tagging is however tri-state — in some sense yes/no/dont-know. The media 
> needs to be able to be tagged: 'I can or do meet a need for accessibility X'; 'I 
> cannot meet a need for accessibility X'; 'I do not know about accessibility X'. 
> For an unstated tag, 'I do not know' is assumed.
> Clearly we can now define when a media source matches user needs. A source 
> *fails* to match if and only if either of the following are true; otherwise, the 
> source matches:
>    1. The user indicates a need for an axis, and the source is tagged as
>       explicitly /not/ meeting that need;
>    2. The user does /not/ indicate a need, and the file is tagged as being
>       explicitly targetted to that need. 

I disagree with #2 being considered a failure.  A video may contain 
features intended for accessibility, such as captions, but if they are 
closed captions, then they don't need to be turned on.  If they are open 
captions, then it's not too much of a problem.  However, at for me, a 
video with open captions should be given a lower priority than one 
without.  Obviously, other people will have different priorities.

> We believe that the source tagging should be done as Media Queries

I don't think we should be jumping to solutions just yet.  Media queries 
is one possibility.  Another is to provide a different attribute or 
several attributes to indicate each axis, and there may be others to 
consider as well.  In fact, I don't think media queries is appropriate 
for this at all, since it's designed for indicating features describing 
the target device, not user preferences.

> 3.2 Method of selection
> We suggest that we add a media query, usable on the audio and video elements, 
> which is parameterized by a list of axes and an indication of whether the media 
> does, or can, meet the need expressed by that axis. The name of the query is 
> TBD; here we use 'accessibility'. An example might be:
> |accessibility(captions:yes, audio-description:no, epilepsy-avoidance:dont-know)|

That doesn't seem to fit the syntax of media queries, where each feature 
is supposed to be given within parenthesis. e.g.

<source ... media="screen and (min-height:240px) and (min-width:320px)">

Also, instead of providing boolean values for each property, we should 
be able to indicate other information about them.

Captions, if available, may be open or closed, and only available in 
particular languages.  Subtitles, if available, may be open or closed 
and be available in one or more languges.  It's even possible to have 
open subtitles in one languge, yet have alternative closed subtitles 
shown over the top if turned on.  Audio descriptions may not be 
available in all of the languages that the video is available in.

For example, take a look at the features of the 101 Dalmations DVD in 


It has English and Dutch audio languages, but only has Audio Description 
  available in English (listed as "English - AD").  It also has English, 
Dutch and Hindi subtitles, but only English captions (listed under 
subtitles as "English - HI", where "HI" means Hearing Impaired).

Another example, English-language TV programmes are broadcast in Norway 
with open Norwegian subtitles.  But it is also possible to turn on 
closed subtitles (using teletext) for some other European languages 
which are then rendered over the top. (I'm not sure which languages they 
are).  Personally, I think the open subtitles are annoying, especially 
since most people here seem to speak English anyway, but it's what they do.

> Note that the second matching rule above means that sources can be ordered in 
> the usual intuitive way — from most specific to most general — but that it also 
> means a source might need to be repeated. For example, if the only available 
> source has open captions (burned in), it could be in a single <source> element 
> without mentioning captions, but it is better in two <source> elements, the 
> first of which explicitly says that captions are supported, and the second is 
> general and un-tagged. This indicates to the user needing captions that their 
> need is consciously being met.

I think we should avoid repetition of source elements pointing to the 
same media, and instead provide ways of accurately describing what each 
has available.

>     3.4 Axes
> We think that the set of axes should be based on a documented set, but that 
> adding a new axis should be easier than producing a new revision of the 
> specification. IANA registration may be a way to go.
> Some of the more obvious axes include:
>    1. Captions
>    2. Subtitles
>    3. Audio description of video
>    4. Sign language 
> Notes:
>    1. The USA and Canada differentiate between captions (a replacement for
>       hearing the audio) and subtitles (a  replacement for audio content that
>       is unintelligible, usually because it's  in a foreign language). Other
>       locales do not make this distinction;  nomenclature will need careful
>       choice if confusion is to be avoided.

This is true in Australia too.  According to Joe Clark, it's only the 
British that get the terminology wrong.


>    2. Subtitles (in the USA and Canada sense) are not strictly an accessibility
>       issue, but can probably be handled here.

Henri Sivonen wrote in a separate mail:
> I would caution against treating subtitles (in the US/Canada sense) an 
> instance of the same selection mechanism engineering problem as captions (in 
> the US/Canada sense) just because they are the same engineering problem as far 
> as encoding timed text goes.
> Not hearing audio is (for practical modeling purposes) a single dimension: One 
> can hear, one can't hear well, one is deaf. I don't know if "can't hear well"
> maps simply to "captions on"

Sometimes, turning on same-language subtitles as opposed to captions is 
useful for people who can't hear well.  For example, my dad has trouble 
hearing the higher frequencies and has difficulty understanding some 
speech because of that.  (e.g. He can't hear the difference between a 
hard C (as in cat) and T sound very well)  So he'll often turn on the 
English subtitles on a DVD so he can read them, but he doesn't need the 
extra information that the English captions provide for people who can't 
hear at all.  I'll even do the same myself some times when I need to 
keep the volume down low.

You make a reasonable case against using them for automatic selection 
purposes.  However, consider the case where subtitles are provided in 
one language, but captions are not.  A hearing impaired person is better 
off knowing the subtitles are available and having them turned on than 
not knowing.  Therefore, it might be better to declare the availability 
of subtitles anyway.

> I would guess that content providers would opt for alternative files in 
> this case, because additional audio tracks show up on the bandwidth bill 
> if served even when not needed.
> ...
> Language skills are multidimensional: A person whose language skills 
> cover a non-English native language and English already has four 
> dimensions: skill level in both reading and listening in both languages. 
> This makes automatic selection mechanism hard to engineer.

Agreed.  But this argues against linking to multiple videos using 
<source>, each with a different audio language.  There are 2 options for 
dealing with this situation:

1. Include all alternative languages within the same video file, which
    increases file size and adds to the bandwidth bill.  This allows
    manual audio selection after the video has downloaded.
2. Using individual videos, but providing manual language selection
    prior to loading the video.  This could also be based on the choice
    the user made when they accessed the website, if the site itself is
    available in multiple langauges too.

Dave Singer wrote:
>    3. Sign language has a number of variants, not easily identified; not only
>       does American sign language differ from British, but the dialects that
>       form around schools that use sign language also diverge significantly.
>       This problem of identifying what sign language is present or desired is
>       exacerbated by ISO 639-2, which has only one code for sign-language
>       ('sgn'). The user preference for which kind of sign language is needed may
>       need storing, as well as their need for sign language in general. We're
>       hoping that the user's general language preferences can be used, for a
>       first pass.

I've not seen many programmes use sign language.  The one show that I 
know of that did some of the time was a childrens early morning cartoon 
show in Australia called Cheez TV, which sometimes had a sign language 
interpreter in the bottom right of the screen interpreting what the 
presenters were saying in the breaks between the cartoons.  Although, I 
believe they must have used closed captions other times because they 
didn't always have the interpreter.

We also need to consider whether or not sign language would be used for 
video on the web, and whether or not it's worth finding a solution to 
declare their availability.  Also, I'm not sure how they would be 
implemented from a technical POV.  Can they be implemented as a separate 
video stream using Picture-in-Picture to overlay the normal video 
stream, or would it need to be a complete alternative video stream? 
This might depend on the container format used.

We would need to find and document some real world cases of online video 
using sign language, so we can investigate how it has been done, if at 
all.  In fact, we really need to find evidence of all forms of 
accessibility features, so we can work out what is and isn't used on the 
web, and what we should prioritise and optimise for.

For example, whether we should optimise for serving a single video file 
with multiple streams, or individual video files, each with a specific 
set of streams.

The requirements for the chosen solution include the following:

1. Provide ways to indicate:
    * Language of open captions
    * Languages of available closed captions
    * Languages of available audio descriptions
    * Languages of available non-descriptive audio streams

    If it is also deemed appropriate to declare subtitles, then:
    * Language of open subtitles
    * Languages of available closed subtitles

    Any or all of those could also be either none or unknown.

2. An easy to use and understand syntax that is not too verbose.
3. Have reasonable default values.
4. Possibly be extensible to allow for other axes to be defined and
    expressed in the future.
5. Avoid unnecessary repetition
6. Support multiple tracks per video file, or multiple videos, each with
    a specific set of streams.

This could be done with attributes.  For example:

<video ... captions="open:en; closed:fr,de"

Or perhaps a single accessibility attribute:

<video access="(captions=open:en;closed:fr,de)
            and (subtitles=closed:nl)
            and (audiolang=en,fr,de)
            and (audiodesc=en)">

The syntax of both of those might be a little complex though, and I 
would prefer to simplify them if possible.  One issue is that while this 
does correctly distinguish between captions and subtitles, educating 
authors to use them correctly rather than interchangeably may be a 
problem, especially given that they incorrectly use the term subtitles 
for both in the UK.

Another problem to consider with automaitic selection mechanisms is 
that, AIUI, common video container formats don't provide a way to 
programmaticly distinguish between subtitle tracks and caption tracks, 
since both are just text tracks.  I think they just provide the ability 
to declare the language of the track, and some also provide the ability 
to include human readable descriptions.  Text tracks can also be used 
for other information besides subtitles and captions.  For example, I've 
seen DVDs provide commentary using a text track without an accompanying 
audio track.

Note that I didn't use the lang or xml:lang attributes to express the 
language of the audio streams because it's limited to declaring a single 
language.  However, in the absense of an explicit audio language 
declaration, then assuming it's the same as the element's language is a 
reasonable default.

Lachlan Hunt - Opera Software
Received on Thursday, 4 September 2008 11:23:27 UTC

This archive was generated by hypermail 2.3.1 : Monday, 2 May 2016 14:27:39 UTC