Accessibility for the Media Elements in HTML5


DW Singer et al., Apple, September 2008

1. Introduction

This document is intended to serve as an introduction to a discussion of the requirements/needs for accessibility support for media elements in HTML5, along with some proposed solutions. It is delberately written not as specification, which would give the illusion that it was final, and also be rather long; but as an introductory document. There are some loose ends ends, and the entire proposal is up for discussion. Also, this is about the framework of how accessibility needs are met; it does not cover technical details at the media level (such as the choice of a caption format, for example).

However, there is some urgency: the HTML5 video and audio elements are being implemented. If there are necessary changes to the specification framework that are not merely additions, we need to find them soon. It is not acceptable to find ourselves having to settle for less-than-best simply because of a failure to think in time.

2. Needs

2.1 Within the media

2.1.1 Selecting

Sometimes one must choose between two different, variant, forms of a media file in order to satisfy a need. Two obvious examples:
  1. if captions are burned in (so-called 'open captions') to one version of a video, and not to another, then the user-agent needs to select the appropriate one for the user's needs.
  2. if audio description of video is provided, often the video is visually frozen at intervals in order to allow time for the audio; the overall presentation is not paused, of course (the audio continues to play); the result is a longer (alternative) presentation.

Hence we need the ability to select (automatically) the appropriate media file based on user preferences. But note that we do not want this selection completely to deny the experience of the media either:

  1. to users who express a need that the site they are looking at was unaware of, or did not cater to; or
  2. to users who did not express a need, accessing a site carefully configured to be explicit about the sources that do or do not meet that need.

Note that web page authors are always at liberty to author pages which delineate explicitly the choices available ("click here for the captioned version or here for the non-captioned one").

Note that the HTML5 specification already has the ability to choose the first suitable source from a list of source elements.

2.1.2 Configuring

Sometimes, similarly, the media format itself can carry optional features. An example might be the 3GPP file format (or any file format from that family, such as MP4) with a text track in 3GPP Timed Text format. Enabling this track (and thereby causing it to be presented) may be a way to satisfy a need within a single media file.

(In some cases, the media format may also need to disable a track; for example, a track providing audio description of video may incorporate the standard audio within it, and the normal audio track would be disabled if the audio description were enabled.)

We therefore also need the ability to apply the same preferences used for selection, to configuring the file. Note that not all media sub-systems will offer the user-agent such an API; that is acceptable – for media files associated with those systems, the files are not configurable and selection must be used instead.

2.2 Associated with the media

2.2.1 Introduction

There are also needs to associate data with the media, rather than embed it within the media. The Web Content Accessibility Guidelines, for example, request that it be possible to associate a text transcript with timed media. Sometimes even, for very short media elements, alternative text may be enough (e.g. "a dog barks").

Finally, we need to consider what should happen if source selection fails: none of the media file sources are considered suitable for this user-agent and user. What is the fallback in this case?

The first two following are taken from the current state of IMG tagging in HTML5.

2.2.2 alt

It's probably much more rarely useful than on images, but as noted above, there may be some small media files which are semantically significant which can be described with a short text string (e.g. "a dog barks"), which could be placed in an alt attribute.

2.2.3 longdesc

The longdesc attribute, when used, takes a URI as value, and links to a 'long description'. It is probably the attribute to use to link to such things as a transcript (though a transcript is more of a fulltext alternative than a description).

2.2.4 fallback content (video not supported vs. no source is suitable)

As noted above, the proposal that we add to the criteria to select a source element further highlights the open question about today's specification: the fallback content within media elements is designed for browsers not implementing audio/video. It is probably inappropriate to overload that use with the case when the browser does implement media elements, but no source is appropriate.

This is an open question.

3. In-media Selecting/Configuring

3.1 Introduction

We propose considering the accessibility needs as a set of independent 'axes', for which the user can express a clear need, and for which a media element can express a clear ability to support, inability to support, or lack of awareness.

The user preferences are two-state: 'I need accessibility X', 'I have no specific need for accessibility X'. For un unstated preference 'no specific need' is assumed.

The tagging is however tri-state — in some sense yes/no/dont-know. The media needs to be able to be tagged: 'I can or do meet a need for accessibility X'; 'I cannot meet a need for accessibility X'; 'I do not know about accessibility X'. For an unstated tag, 'I do not know' is assumed.

Clearly we can now define when a media source matches user needs. A source fails to match if and only if either of the following are true; otherwise, the source matches:

  1. The user indicates a need for an axis, and the source is tagged as explicitly not meeting that need;
  2. The user does not indicate a need, and the file is tagged as being explicitly targetted to that need.

We believe that the source tagging should be done as Media Queries.

There is work ongoing at Dublin Core and IMS on the ways to state user preferences for accessibility, which may be relevant.

3.2 Method of selection

We suggest that we add a media query, usable on the audio and video elements, which is parameterized by a list of axes and an indication of whether the media does, or can, meet the need expressed by that axis. The name of the query is TBD; here we use 'accessibility'. An example might be:

accessibility(captions:yes, audio-description:no, epilepsy-avoidance:dont-know)

Note that the second matching rule above means that sources can be ordered in the usual intuitive way — from most specific to most general — but that it also means a source might need to be repeated. For example, if the only available source has open captions (burned in), it could be in a single <source> element without mentioning captions, but it is better in two <source> elements, the first of which explicitly says that captions are supported, and the second is general and un-tagged. This indicates to the user needing captions that their need is consciously being met.

3.3 Configuration

The same user preferences as were used to match the selection of a source element must also be provided, if possible, to the media sub-system, to configure the media file. Just as the way that (for example) play/pause requests are relayed to the media sub-system is out of scope of the HTML5 specification, so also is the way in which the media sub-system becomes aware of user preferences.

3.4 Axes

We think that the set of axes should be based on a documented set, but that adding a new axis should be easier than producing a new revision of the specification. IANA registration may be a way to go.

Some of the more obvious axes include:

  1. Captions
  2. Subtitles
  3. Audio description of video
  4. Sign language


  1. The USA and Canada differentiate between captions (a replacement for hearing the audio) and subtitles (a replacement for audio content that is unintelligible, usually because it's in a foreign language). Other locales do not make this distinction; nomenclature will need careful choice if confusion is to be avoided.
  2. Subtitles (in the USA and Canada sense) are not strictly an accessibility issue, but can probably be handled here.
  3. Sign language has a number of variants, not easily identified; not only does American sign language differ from British, but the dialects that form around schools that use sign language also diverge significantly. This problem of identifying what sign language is present or desired is exacerbated by ISO 639-2, which has only one code for sign-language ('sgn'). The user preference for which kind of sign language is needed may need storing, as well as their need for sign language in general. We're hoping that the user's general language preferences can be used, for a first pass.

The following axes might be thought of as more abstruse, but are plausible:

  1. High-contrast video: video where the significant content stands out well;
  2. High-clarity audio: essentially the same for audio: filler and background noise is reduced;
  3. Epilepsy-avoidance video: some people are susceptible to video that flashes at certain frequencies, and this is used in e.g. music video sometimes.

Note that since it's not possible to express a concrete anti-preference ('I absolutely must avoid captions'), all accessibility axes have to be expressed in terms of something positively needed ('I need video that avoids inducing epileptic fits') rather than avoided (you cannot say 'I must not be presented with video that might induce epileptic fits').

There are probably other 'accessibility axes' that will appear over time as we gain experience.

4. Associated with the media

4.1 alt

The alt attribute should be permitted, but not required, on the video and audio elements. The usual 'special case' of alt="" should perhaps also be documented to indicate content that is 'decorative' and that it is not important if the user cannot see/hear it.

4.2 longdesc

Similarly the longdesc attribute, on the video and audio elements, could perhaps be used to link to transcripts or other longer descriptions of the media. Perhaps one of alt or longdesc should be required; but we rather hope that in-media accessibility will provide the desired experience for those needing accessible access, and making one of alt or longdesc required rather implies that in-media accessibility is expected to fail.

4.3 Source fallback

As noted above, the addition of 'meets accessibility needs' to the source element selection raises the possibility that no source will be acceptable, and currently the HTML5 spec. provides no fallback for this case (possibly because it's not obvious where to put it). This is an open question: is it needed, and if so, where should it go?

5. References

  1. HTML5 Media elements:
  2. 3GPP Timed Text:
  3. Wikipedia on captions:
  4. Wikipedia on subtitles:
  5. Wikipedia on Audio Description:
  6. Media queries in CSS3:
  7. WCAG on audio/video:
  8. Dublin Core on User prefs:
  9. IMS work related to user prefs:
  10. HTML5 on alt (and mention of longdesc):
  11. Internet Assigned Numbers Authority: