Accessibility for the Media
Elements in HTML5
DW Singer et al., Apple,
1. IntroductionThis document is intended to
serve as an introduction to a discussion of the requirements/needs
for accessibility support for media elements in
HTML5, along with some proposed solutions. It is delberately
written not as specification, which would give the illusion that it
was final, and also be rather long; but as an introductory document.
There are some loose ends ends, and the entire proposal is up for
discussion. Also, this is about the framework of how accessibility
needs are met; it does not cover technical details at the media
level (such as the choice of a caption format, for
However, there is some urgency: the HTML5 video and audio
elements are being implemented. If there are necessary changes to the
specification framework that are not merely additions, we need to
find them soon. It is not acceptable to find ourselves having to
settle for less-than-best simply because of a failure to think in
2.1 Within the media
Sometimes one must choose between two different,
variant, forms of a media file in order to satisfy a need. Two
- if captions are burned in (so-called 'open
captions') to one version of a video, and not to another, then the
user-agent needs to select the appropriate one for the user's needs.
- if audio description of video is
provided, often the video is visually frozen at intervals in order to
allow time for the audio; the overall presentation is not paused, of
course (the audio continues to play); the result is a longer
Hence we need the ability to select
(automatically) the appropriate media file based on user preferences.
But note that we do not want this selection completely to
deny the experience of the media either:
- to users who express a need that
the site they are looking at was unaware of, or did not cater to; or
- to users who did not express a
need, accessing a site carefully configured to be explicit about the
sources that do or do not meet that need.
Note that web page authors are always at liberty
to author pages which delineate explicitly the choices available
("click here for the captioned version or here for the
Note that the HTML5 specification already has the
ability to choose the first suitable source from a list of source
Sometimes, similarly, the media format itself can
carry optional features. An example might be the 3GPP file format
(or any file format from that family, such as MP4) with a text track
in 3GPP Timed Text format. Enabling this
track (and thereby causing it to be presented) may be a way to
satisfy a need within a single media file.
(In some cases, the media format may also need to
disable a track; for example, a track providing audio description of
video may incorporate the standard audio within it, and the normal
audio track would be disabled if the audio description were
We therefore also need the ability to apply the
same preferences used for selection, to configuring the file. Note
that not all media sub-systems will offer the user-agent such an API;
that is acceptable for media files associated with those
systems, the files are not configurable and selection must be used
2.2 Associated with the media
2.2.1 IntroductionThere are also needs to
associate data with the media, rather than embed it within the media.
The Web Content Accessibility Guidelines,
for example, request that it be possible to associate a text
transcript with timed media. Sometimes even, for very short media
elements, alternative text may be enough (e.g. "a dog
Finally, we need to consider what should happen if
source selection fails: none of the media file sources are considered
suitable for this user-agent and user. What is the fallback in this
The first two following are taken from the current
state of IMG tagging in HTML5.
2.2.2 altIt's probably much more rarely
useful than on images, but as noted above, there may be some small
media files which are semantically significant which can be described
with a short text string (e.g. "a dog barks"), which could
be placed in an alt attribute.
2.2.3 longdescThe longdesc
attribute, when used, takes a URI as value, and links to a 'long
description'. It is probably the attribute to use to link to such
things as a transcript (though a transcript is more of a fulltext
alternative than a description).
2.2.4 fallback content (video not
supported vs. no source is suitable)
As noted above, the proposal that we add to the
criteria to select a source element further highlights the open
question about today's specification: the fallback content within
media elements is designed for browsers not implementing audio/video.
It is probably inappropriate to overload that use with the case when
the browser does implement media elements, but no source is
This is an open question.
3. In-media Selecting/Configuring
We propose considering the accessibility needs as
a set of independent 'axes', for which the user can express a clear
need, and for which a media element can express a clear ability to
support, inability to support, or lack of awareness.
The user preferences are two-state: 'I need
accessibility X', 'I have no specific need for accessibility X'. For
un unstated preference 'no specific need' is assumed.
The tagging is however tri-state in some
sense yes/no/dont-know. The media needs to be able to be tagged: 'I
can or do meet a need for accessibility X'; 'I cannot meet a need for
accessibility X'; 'I do not know about accessibility X'. For an
unstated tag, 'I do not know' is assumed.
Clearly we can now define when a media source
matches user needs. A source fails to match if and
only if either of the following are true; otherwise, the source
- The user indicates a need for an
axis, and the source is tagged as explicitly not meeting
- The user does not
indicate a need, and the file is tagged as being explicitly targetted
to that need.
We believe that the source tagging should be done
as Media Queries.
There is work ongoing at Dublin Core and IMS
on the ways to state user preferences for accessibility, which may be
3.2 Method of selectionWe suggest that we
add a media query, usable on the audio and video elements, which is
parameterized by a list of axes and an indication of whether the
media does, or can, meet the need expressed by that axis. The name of
the query is TBD; here we use 'accessibility'. An example might
Note that the second matching rule above means
that sources can be ordered in the usual intuitive way from
most specific to most general but that it also means a source
might need to be repeated. For example, if the only available source
has open captions (burned in), it could be in a single <source>
element without mentioning captions, but it is better in two
<source> elements, the first of which explicitly says that
captions are supported, and the second is general and un-tagged. This
indicates to the user needing captions that their need is consciously
The same user preferences as were used to match the
selection of a source element must also be provided, if possible, to
the media sub-system, to configure the media file. Just as the way
that (for example) play/pause requests are relayed to the media
sub-system is out of scope of the HTML5 specification, so also is the
way in which the media sub-system becomes aware of user preferences.
We think that the set of axes should be based on a
documented set, but that adding a new axis should be easier than
producing a new revision of the specification. IANA registration may be a way to go.
Some of the more obvious axes include:
description of video
- Sign language
- The USA and Canada differentiate
between captions (a replacement for hearing
the audio) and subtitles (a replacement for
audio content that is unintelligible, usually because it's in a
foreign language). Other locales do not make this distinction;
nomenclature will need careful choice if confusion is to be
- Subtitles (in the USA and Canada sense) are not
strictly an accessibility issue, but can probably be handled here.
- Sign language has a number of variants, not easily
identified; not only does American sign language differ from British,
but the dialects that form around schools that use sign language also
diverge significantly. This problem of identifying what sign language
is present or desired is exacerbated by ISO 639-2, which has only one
code for sign-language ('sgn'). The user preference for which kind
of sign language is needed may need storing, as well as their need
for sign language in general. We're hoping that the user's general
language preferences can be used, for a first pass.
The following axes might be thought of as more
abstruse, but are plausible:
- High-contrast video: video where
the significant content stands out well;
- High-clarity audio: essentially
the same for audio: filler and background noise is reduced;
- Epilepsy-avoidance video: some
people are susceptible to video that flashes at certain frequencies,
and this is used in e.g. music video sometimes.
Note that since it's not possible to express a
concrete anti-preference ('I absolutely must avoid captions'), all
accessibility axes have to be expressed in terms of something
positively needed ('I need video that avoids inducing epileptic
fits') rather than avoided (you cannot say 'I must not be presented
with video that might induce epileptic fits').
There are probably other 'accessibility axes' that
will appear over time as we gain experience.
4. Associated with the media
The alt attribute should be permitted, but not
required, on the video and audio elements. The usual 'special case'
of alt="" should perhaps also be documented to indicate
content that is 'decorative' and that it is not important if the user
cannot see/hear it.
4.2 longdescSimilarly the longdesc
attribute, on the video and audio elements, could perhaps be used to
link to transcripts or other longer descriptions of the media.
Perhaps one of alt or longdesc should be required; but we rather hope
that in-media accessibility will provide the desired experience for
those needing accessible access, and making one of alt or longdesc
required rather implies that in-media accessibility is expected to
4.3 Source fallback
As noted above, the addition of 'meets accessibility needs' to the
source element selection raises the possibility that no source will
be acceptable, and currently the HTML5 spec. provides no fallback for
this case (possibly because it's not obvious where to put it). This
is an open question: is it needed, and if so, where should it go?
- HTML5 Media elements: http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#video
- 3GPP Timed Text: http://www.3gpp.org/ftp/Specs/html-info/26245.htm
- Wikipedia on captions: http://en.wikipedia.org/wiki/Closed_captioning
- Wikipedia on subtitles: http://en.wikipedia.org/wiki/Subtitles
- Wikipedia on Audio Description: http://en.wikipedia.org/wiki/Audio_description
- Media queries in CSS3: http://www.w3.org/TR/css3-mediaqueries/
- WCAG on audio/video: http://www.w3.org/WAI/GL/WCAG20/#media-equiv
- Dublin Core on User prefs:
- IMS work related to user prefs: http://www.imsglobal.org/accessibility.html
- HTML5 on alt (and mention of longdesc):
- Internet Assigned Numbers Authority: http://www.iana.org/