Accessibility of Media Elements in HTML 5


Friends and Colleagues,

David Singer and I plan to hold an informal workshop or two on the subject
of Accessibility of Media Elements in HTML 5.  The media elements are
audio and video, and their supporting elements such as source.

This will be an informal workshop, as we wish to hold it before the
November 2009 TPAC in Santa Clara, CA. This informal workshop will last
one day, and the first one will be held on November 1st at Stanford
University.  To attend the workshop, you must first Register (there is no
financial charge associated) and then also come prepared to present on one
of the questions below, or a suitable other question, drawing from your
experience or expertise to help inform the discussion and make progress on
proposing solutions.  The Registration Form is at: 
This gathering is done under the auspices of the W3C HyperText
Coordination Group.


The current specification of Timed Media elements HTML5 takes a fairly
hard-nosed approach to what is presented as timed media:  it is inside the
timed media files that are selected from the sources. 
There is currently no provision for linking or synchronizing other
material, and there is no discussion of how to manage the media so it's
accessible.  This needs addressing.

We would like to understand the 'landscape' and put in place good
architectural support in general, as well as making sure that specific
solutions exist to the more pressing problems.  We anticipate working, in
public, to develop proposals for any changes to specifications that might
be suggested by the work, and also to develop a cohesive 'best practices'
document that shows how those provisions can be used, by authors, by user
agents (browsers), and users, to address the issues we identify.

We are aware that good accessibility rests on four legs (at least):

  1. Proper provision in the specifications and documentation of those
provisions and how to use them; 

  2. Willingness and ability to use those provisions effectively on the
part of authors; 

  3. Provision of the right preferences, tools, and user interfaces in
user agents to enable access to the provisions, perhaps automatically; and

  4. The ability of those who need the provisions to find, enable or
access them, and understand what they get.

It's easy to fail on one of these, and good accessibility is not then

Accessibility provisions for Timed Media might themselves be timed (e.g.
captions) or un-timed (e.g. a readable screen-play or transcript).  We
wish to consider both categories.

The questions we would like to address include, but are not limited to the

# What accessibility issues, and what are the 'classic' provisions for
them, in timed media?

We are all aware of captioning for those who cannot hear the audio; less
common is audio description of video, for those who cannot see. 
The BBC recently had some content that had optional sign-language
overlays.  Issues can also arise with susceptibility (e.g. flashing videos
and epilepsy, color vision issues, and so on).

# What solution frameworks already exist that would be relevant?

We are all aware of the existence, for example, of screen readers and
perhaps even Braille output devices.  We've seen tags in other parts of
HTML that are there to support accessibility, and frameworks such as ARIA.
Are there existing good practices that naturally extend to Timed Media?

# Are there solutions that will benefit, be tested and seen by, and more
likely authored by, the wider community?

There have been ongoing debates about whether 'unique' provision for
accessibility (functions with no other purpose) are desirable.  We do not
intend to have this philosophical debate, but it would be useful to hear
of related problems and opportunities that help make the debate
irrelevant.  For example, the provision of a transcript or separately
accessible captions, in text form, makes indexing and searching content
much easier.  Are there problems like this that we can address that will
make it more likely that authors build accessible timed media?

# What new problems and new opportunities arise when we use digital media
embedded in the world-wide-web?

Much of the work and research in this area has been done for isolated,
analog, systems (classic television). Instead, we have a digital content
presented in a rich context (web content).  What new opportunities and
solutions are opened up by this?

# What technologies and solutions exist that we should notice?

The work of the W3C on a common Timed Text format, and the existence of
general frameworks such as ARIA (Accessible Rich Internet Applications),
suggest that there are pieces of the solution space we should consider.
What are they?

# What can be done today, given the structures we have? What experiments
and proof-of-concept work should we notice?

We are aware that there are a number of pioneering organizations in this
area.  The BBC's work with sign-language has already been noted; workflows
for captioning content have been developed in a number of places.  There
have been script-based experiments on captioning. 
What are some of these systems and experiments, and what can we learn from

We expect the workshop to spend perhaps two-thirds of the time on these
presentations, with short Q&A for each.  Then we may have a panel session
or two, or moderated discussion, to address focused questions.  As stated
in the introduction, we are looking for a framework and solutions with
good 'longevity', simplicity, and efficacy, that will be embraced by the
standards community, content authors, user agent developers, and end
users.  This is ambitious but achievable, we believe, and opportunities
such as this to 'get it right from the start' come up all too rarely.

We think that at least the following communities and groups might be

* HTML 5, the place where the Timed Media tags are specified, and the
integration therefore must occur;
* PFWG, where much thought has gone into this general problem space;
* Media Annotations, who are concerned with metadata for Timed Media;
* Timed Text, owners of DFXP, one of the likely text formats;
* CSS, who define the styling of text, and also the nature of 'rendering
surfaces' (and a presentation where a provision is needed, such as
captions, might be seen as a rendering surface of a specific kind).

If you feel prepared to attend, present, and work cooperatively on this
problem, please contact the workshop organizers as soon as possible.
David Singer
Multimedia Standards, Apple Inc. 

John Foliot
Stanford University Online Accessibility Program 

Received on Wednesday, 14 October 2009 23:53:31 UTC