[whatwg] Thoughts on video accessibility

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com> · Date: Tue, 9 Dec 2008 12:27:08 +1100

Hi everybody,

For the last 2 months, I have been investigating means of satisfying
video accessibility needs through Ogg in Mozilla/Firefox for HTML5.

You will find a lot of information about our work at
https://wiki.mozilla.org/Accessibility/Video_Accessibility and in the
archives of the Ogg accessibility mailing list at
http://lists.xiph.org/mailman/listinfo/accessibility .

I wanted to give some feedback here on our findings, since some of
them will have an impact on the HTML5 specification.

What are we talking about
-----------------------------------
When I say "video accessibility", I'm actually only talking about
time-aligned text formats and not e.g. captions as bitmaps or audio
annotations as wave files.
Since we analysed how to attach time-aligned text formats with video
in a Web Browser, we also did not want to restrict ourselves to only
closed captions and subtitles.
It made sense to extend this to any type of time-aligned text on can
think about, including textual audio annotations (to be consumed by
the blind through a screenreader or braille output), karaoke, speech
bubbles, hyperlinked text annotations, and others. There is a list at
http://wiki.xiph.org/index.php/OggText#Categories_of_Text_Codecs which
gives you a more complete picture.

How is it currently done
-------------------------------
When looking at the existing situation around time-aligned text for
video, I found a very diverse set ot formats and means of doing it.

First of all, most media players allow you to load a video file and a
caption/subtitle file for it in two separate steps. The reason is that
most subtitles are produced by other people than the original content
and this allows the player to synchronise them together. This is
particularly the case with the vast majority of SRT and SUB subtitle
files, but is also the case for SMIL- and DFXP-based subtitle files.

>From a media file format POV, some formats have a means of
multiplexing time-aligned text into the format, e.g. QuickTime has
QTText and Flash has cuepoints. Others prefer to use external
references, e.g. WindowsMedia and SAMI or SMIL files, RealMedia and
SMIL files.

For mobile applications, a subset of DFXP has been defined in 3GPP
TimedText, which is actually being encapsulated into QuickTime QTText
using some extensions, and can be encapsulated into MP4 using the
MPEG-4 TTXT specification.

As can be seen, the current situation is such that time-aligned text
is being handled both in-stream and out-of-band and there are indeed
requirements for both situations.

Requirements
-------------------
Not to go into much detail here, but I have seen extensive arguments
made on both sides of the equation for and against in-stream text
tracks.
One particular argument for in-stream text is that of downloading the
video from some place and keeping all its information together in one
file such that when it is distributed again, it retains that
information.
One particular argument for out-of-band text is the ability to add
text tracks at a later stage, from another site, and even from a web
service (e.g. a translation web service that uses an existing caption
file and translates it into another language).
In view of these requirements, I strongly believe we need to enable
people to do both: provide time-aligned text through
external/out-of-band resources and through in-stream, where the
container format allows this.

Proposal for out-of-band approach
----------------------------------------------
I'd like to stimulate a discussion here about how we can support
out-of-band time-aligned text for video in HTML5.
I have seen previous proposals, such as the "track" element at
http://esw.w3.org/topic/HTML/MultimediaAccessibilty#head-a83ba3666e7a437bf966c6bb210cec392dc6ca53
and would like to propose the following specification.

Take this as an example:

<video src="http://example.com/video.ogv" controls>
 <text category="CC" lang="en" type="text/x-srt" src="caption.srt"></text>
 <text category="SUB" lang="de" type="application/ttaf+xml"
src="german.dfxp"></text>
 <text category="SUB" lang="jp" type="application/smil"
src="japanese.smil"></text>
 <text category="SUB" lang="fr" type="text/x-srt"
src="translation_webservice/fr/caption.srt"></text>
</video>

* "text" elements are subelements of the "video" element and therefore
clearly related to one video (even if it comes in different formats).
[BTW: I'm happy to rename this to textarea or whatever else people
prefer to call it].

* the "category" tag (could also be renamed "role" if we prefer)
allows us to specify what text category we are dealing with and allows
the web browser to determine how to display it (there would be default
display for the different categories and css would allow to override
these).

* the "lang" tag would allow the specification of alternative
resources based on language, which allows the browser to select one by
default based on browser preferences, and also to turn those tracks on
by default that a particular user requires (e.g. because they are
blind and have preset the browser accordingly)

* the "type" tag allows specification of what actual time-aligned text
format is being used in this instance; again, it will allow the
browser to determine whether it is able to decode the file and thus
make it availalbe through an interface or not.

* the "src" attribute obviously points to the time-aligned text
resource. This could be a file, a script that extracts data from a
database, or even a web service that dynamically creates the data
based on some input.

This provides for a lot of flexibility and is somewhat independent of
the media file format, while still enabling the Web browser to deal
with the text (as long as it can decode it).

What do people think?

Regards,
Silvia.

BTW: We are in parallel working on getting time-aligned text support
into Ogg - see the spec at http://wiki.xiph.org/index.php/OggText . It
will provide a similarly flexible approach for any kind of text format
as this element does. This means that mapping into the DOM would work
in a similar way from within Ogg as it would from a "text" element as
defined above.