This document presents a number of reflections on the accessibility issues that arise from the current design of the HTML5 audio and video elements. My reflections are based on experience with the various generations of the W3C SMIL specification, as member (and most recently co-chair) of the Synchronized Multimedia Working Group (SYMM). This note is written as a part of my participation in the HTML Accessibility Task Force.
The goal of this document is to suggest the addition of several control attribute and a repartitioning the synchronization mechanisms in HTML5 to better serve the needs of the accessibility community. The current HTML5 specification (as of 4 March 2010) over-loads synchronization concepts onto the audio and video elements: these become a combination of timing and content containers. By explicitly separating timing semantics from media content specification, a more robust and future-proof architecture can be designed. Note that since most of the synchronization issues touched on in this note are fundamental consequences of coordinating the set of existing media objects already present in HTML5 (such as multiple source encodings and captions/subtitles), the explicit consideration of synchronization structure does not impose a new burden on implementors: it actually makes the existing implementation task more manageable by isolating specific classes of functionality.
One of the important benefits of HTML5 is the introduction of the <audio> and <video> elements as first-class citizens of the HTML language. While the addition of these elements may at first glance appear as simple extensions of the existing media types offered by HTML (such as <img> and embedded text), they have a fundamentally new property for HTML: they define a temporal scope. Where text and images are rendered in their entirety instantaneously at the moment a page is loaded, the presentation of audio and video is done incrementally, based on the temporal properties of the media element and the effect of any interaction control done via the browser's UI.
Supporting reliable audio and video is not easy. The special requirements of these media -- each supported by a mix of highly open and highly licensed codecs -- have motivated non-trivial extensions to the <audio> and <video> elements: extensions that allow selection among multiple media encodings and multiple tracks within single encodings (for example, for embedded or external captions). This introduces an inherent need to synchronize multiple content streams within the context of the temporal scope of an <audio> or <video> element. The manner in which this is proposed in the current HTML5 specification has a number of serious limitations that could be avoided by a more structured approach.
This note discusses the strengths and weaknesses of the current HTML5 specification from the perspective of SMIL-based structured temporal control. A proposal is made to include selected W3C temporal structural semantics from SMIL into the HTML5 specification. The specific motivation of this note is to suggest relatively minor improvements to HTML5 that will better support the needs of the accessibility community (and, by extension, all potential HTML5 users). The implementation implications of these suggestions are considered in the document.
This section provide a brief review of the declarative aspects of HTML5's support for temporal media from the perspective of existing support within SMIL. It aligns the point of views of the writer and readers of this document. The discussion focuses on synchronization issues in terms of the <video> element; where specific audio-related issues arise, these will be treated separately.
The goal of the <video> element is to provide a simple framework in which video content can be integrated into a web page without requiring the use of an external plug-in component.
... <div id="A"> ... <video id="V" src="media.fmt" poster="media.jpg" /> ... </div> ...
In this example, the video element 'V' is contained within the structure container 'A'. There is no temporal relationship defined between 'V' and 'A', or with other instances of the video element in the document. In other words, the video element operates on a temporal island within the document.
A video element becomes active when the page is loaded, but the nature of video means that the content is not always available for immediate playback. The current HTML5 specification allows both declarative and scripted control over the moment at which video rendering begins. When using declarative syntax, the boolean autoplay attribute allows a video to start as soon as the browser determines that enough content is available for rendering. A video without an autoplay attribute and without a control pane (discussed in the next section) only becomes active if it is activated by an associated script.
The poster attribute allows an image to be defined that is displayed before any content is rendered. (The specification is required to be representative, but it is not clear how this will be enforced.) If the poster attribute is not present, or if the target of the poster is not found, the initial frame of the video is used.
After the rendering of the video is completed, the specification dictates that the last frame of the video be 'frozen' on the page unless the boolean loop attribute has been specified. The loop attribute causes the content to loop indefinitely -- there is no support for a loop counter, or a loop duration.
Unless modified explicitly or implicitly by user control, a video is displayed for its entire intrinsic duration, plus any (pre)buffering delay associated with fetching media content. The specification dictates that the video rendering must start at the beginning of the video. No support for playing a portion of a video is provided, and no support is provided for indexing past the start of a video.
Potential bugs: Boolean semantics
Although the autoplay and loop attributes are defined as booleans, the specification does not seem to define boolean behavior. For example, all of the following forms cause autoplay or looping behavior, respectively:
<video autoplay ... /> <video autoplay="true" ... /> <video autoplay="false" ... /> <video loop ... /> <video loop="true" ... /> <video loop="false" ... />
This behavior will inhibit any future inclusion of attribute value animation (and it is pretty counter-intuitive).
One of the useful features of the video architecture in HTML5 is the ability to attach a standard or custom video playback control object to the rendering area for the video content. The standard controller is activated by the controls attribute:
... <div id="A"> ... <video id="V" src="media.fmt" controls /> ... </div> ...
The exact semantics of 'standard' control pane behavior is not specified with the video element. The behavior with Safari under MAC OS 10.6 is that the control pane is visible when the video is not playing, or when the video is playing AND the mouse is over the video rendering area. When visible, the control pane is placed within the rendering space of the video object.
The introduction of the controls attribute adds a fundamental behavioral extension to the element: it mixes the definition of an content object with a control object: rather than simply rendering video frames, the video element is extended to regulate it's presentation context. As we will see, this has a number of implications for accessibility support.
Potential bugs: Boolean semantics
Although the controls attributes is defined as booleans, the specification does not define boolean behavior. For example, all of the following forms cause controls to appear:
<video controls ... /> <video controls="true" ... /> <video controls="false" ... /> <video controls="speechControl" ... />
In general, the choice of a boolean value is unfortunate since it limits the ability of user agents to specify a control pane style that may need the requirements of users with special needs.
One of the initial problems encountered when supporting video data was that, unlike image or text data, it was not practical to assume that all browsers would support a single video codec. (This is true for audio as well, but the audio alternatives as less contentious than those for video.) The proposed solution in the current draft is to extend the video element with source elements as children:
... <div id="A"> ... <video controls autoplay> <source src='movieHD.mp4' type='video/mp4; codecs="..."'/> <source src='movieQCIF.mp4' type='video/mp4; codecs="..."'/> <source src='movie.ogv' type='video/ogg; codecs="..."' /> ... </video> ... </div> ...
The source encodings are only evaluated if there is no src attribute on the parent video element. In the current draft, HTML5 makes the following assumptions about alternate source encodings:
From a language perspective, the <source>-based selection mechanism is very similar to the SMIL <switch>, with the restriction that only codec selection is considered.
Potential bugs: Limited semantics
The current HTML5 architecture means that users on low-bandwidth connections or on devices with limited processing capabilities (such as mobile devices) may be presented with video content that is not suitable for their device. For example, a mobile device may be sent full HD content, even if a more suitable encoding was available lexically later in the list of alternatives. This could lead to significant (and unexpected) delays and the potential for high use charges.
In addition to supporting multiple mutually exclusive media encodings within a video tag, a current A11Y Task Force proposal defines the <track> and <trackgroup> elements to provides support for the selective inclusion of timed text tracks:
... <div id="A"> ... <video controls autoplay> <source src='video.mp4' type='video/mp4; codecs="..."'/> <source src='video.ogv' type='video/ogg; codecs="..."' /> ... <track src="video_cc.dfxp" type="application/ttaf+xml" language="en" role="caption"/> <track src="video_tad.srt" type="text/srt" language="en" role="textaudesc"/> <trackgroup role="subtitle"> <track src="video_sub_en.srt" type="text/srt; charset='Windows-1252'" language="en"/> <track src="video_sub_de.srt" type="text/srt; charset='ISO-8859-1'" language="de"/> <track src="video_sub_ja.srt" type="text/srt; charset='EUC-JP'" language="ja"/> </trackgroup> </video> ... </div> ...
The are a number of implicit assumptions that are made in the <track> and <trackgroup> proposal. These are:
As with the <source> element, the <track> and <trackgroup> elements are similar to SMIL's <switch>, but limited to processing based on role and language.
Potential bugs: Limited semantics
The interaction between the control pane and the text rendering area is not fully specified: what happens if the video height is relatively small and controls+subtitles are active? On a related note, can a UI resize the subtitles to meet the needs of the viewer?
The are many good things to be said for the general architecture of the video element: the design of the presentation for a single video object is compact, and the integration of standard controls is a welcome addition to the repertoire of declarative media constructs. Unfortunately, the proposal has several weaknesses, especially or supporting accessibility needs.
The weakness of HTML5's audio/video architecture can be grouped into 5 categories:
The subsections below discuss each of these issues in greater detail. Specific extensions are proposed for extending the semantics of the current specification.
It was clear early in the development of HTML5 that individual implementations of the video architecture and individual viewer needs would require a method to specify content alternatives. At present, each new constraint -- the selection of multiple sources or the specification of multiple tracks -- has brought with it special-purpose solutions (the <source> and <track>/<trackgroup> elements). If this model continues, new elements will be required to select on browser family, implementation architectures, target screen sizes, connection bandwidth and the support for different streaming architectures (to name a few). This is a non-sustainable approach.
Instead of this piecemeal approach, a more general architecture is needed to handle open-ended selection needs. From this perspective, SMIL already addresses the general selection problem with SMIL content control architecture. This architecture has been used by Daisy for over a decade to address content selection needs in accessibility contexts.
The SMIL content control architecture consists of a method for conditionally activating a single element in a document, or for selecting one of a number of mutually-exclusive out of a set of alternatives. The architecture allows natural sub-structuring of content alternatives.
In its simplest form, conditional content inclusion is controlled one or more predicates defined as attributes. In SMIL, these are called system test predicates. For example, the following content container would only be included within rendering if the language selection in the user agent was English:
... <div systemLanguage="en"> ... </div> ...
Note: a <div> has been used as a general example; the control predicate could be attached to any statement, including the audio and video tags.
A set of mutually-exclusive alternatives are wrapped within a <switch> element:
... <switch> <div systemLanguage="en"> ... </div> <div systemLanguage="fr"> ... </div> <div systemLanguage="jp"> ... </div> <div > ... </div> </switch> ...
The various alternatives within a <switch> are evaluated in lexical order. The first test predicate that evaluates to true will cause the associated element to become active. Once an active candidate is determined, the evaluation stops. Note that, in this example, the final <div> element within the <switch> contains no test predicate -- such statements always evaluate to true, and provide a useful means of specifying a default condition. If no (final) default statement is provided, no alternative is selected within the <switch>. Note also that the switch statements may be nested, with each internal switch containing a test predicate; this provides a structured way of including multiple sets of alternatives.
SMIL defines a basic set of test predicates. These may or may not all be relevant for HTML5. HTML5 may wish to define its own predicates, either as part of the language or as part of an implementation-specific set of test attributes. This provides an extensible architecture at no extra implementation effort.
The SMIL content control architecture is a-temporal: that is, it has no timing semantics. It is used purely to specify content selection alternatives. Note that the support for SMIL-style content control does not remove the need for scripting-based solutions: for complex content control specifications, scripting may provide a more efficient solution. At the same time, while a script may meet the needs of a portion of the HTML5 user community, it limits the ability of grass-roots accessibility support. Often, accessibility is added locally by (or for) users with special needs. A simple, declarative structure is a valuable addition to the technology toolkit available for HTML5 content creators.
Why is SMIL content control relevant for HTML5? It is relevant because its generalized architecture provides a systematic means of specifying alternatives, rather than the ad hoc method currently used. It is important to realize that adding the switch to HTML5 does not increase language complexity -- it simply provides a common means of managing the selection complexity that already exists.
The switch element can be integrated as the child of a audio/video element, or as a container in which various content elements can be placed. Here are examples of both approaches:
... <video ... > <switch> <source src='myVideo.mp4' type='video/mp4; codecs="..."'/> <source src='myVideo.ogv' type='video/ogg; codecs="..."' /> <p> No codec available for this browser. </p> </switch> <switch systemCaptions="true"> <track src='myCaptions-en.dfxp' language="en"/> <track src='myCaptions-de.dfxp' language="de"/> </switch> </video> ...
This use of the switch element closely models the current HTML5 draft's use of the source element and the track elements as children of the video element. The difference is that the selection is structured as a collection of switch objects, each of which meet a particular need. The first switch is always evaluated, and activates one of two video encodings or it displays a message that no useable video encoding was supported. The second switch is only evaluated if captions have been requested -- and it will display either English or German captions. If any other language is selected, nothing is displayed.
... <switch> <video src='myVideo.mp4' type='video/mp4; codecs="..."'/> <video src='myVideo.ogv' type='video/ogg; codecs="..."' /> <p> No codec available for this browser. </p> </switch> <switch systemCaptions="true"> <text src='myCaptions-en.dfxp' language="en"/> <text src='myCaptions-de.dfxp' language="de"/> </switch> ...
This second use of the switch element shows a much clearer and cleaner structuring of content alternatives. The artificial <source> and <track> elements can be discarded. Instead, one of two mutually-exclusive video objects may be selected, with a text message available if neither are valid. Similarly, the two candidate captions objects can be wrapped in a <source>.
The advantage of this approach is that useless extra elements are removed from the language, and that selection is promoted to a first-class object.
... <div systemComponent="video"> <video controls src="myVideos.ogv" > <text src='myCaptions.dfxp' systemCaptions="de"/> </div> ...
In many practical situations, there are no set of video encodings available, and there aren't a list of captions alternatives: either you can play the video you have (or you can't), and you either want captions or you don't. The SMIL content control architecture allows this simple case to be supported as shown above. The parent <div> element contains a selection predicate that instructs the agent to only evaluate this element if Video is supported. If so, we render the video. In addition, if captions are desired, the captions file is shown. No video? You also don't see the captions. Of course, this is not always want you want: sometimes you'd like more composition possibilities. This is addressed in the next section.
The implementation of the SMIL content control infrastructure requires providing support for test predicates and the <switch> element. The switch element may be used as a child of the video (or audio) elements, but this is probably not a very clean solution since it would still require use of the artificial <source> element. A more structured approach is when the a-temporal switch is used to wrap peer-level alternative video and text content elements.
When SMIL defined it version of the test attribute predicates, a fixed collection was defined as part of the language specification. For HTML5, it may be better to define the test predicate architecture with a very limited set of common test predicates. Each implementation could then publish its own -- hopefully inter-vendor -- list of extension predicates. This allows the entire architecture to adapt to the needs of new devices and features, and it provides the suppliers of assistive technology an easy way to profile the use of new, accessible technology.
The basic selection architecture required to support SMIL content control is already available in HTML5 implementations. The main issue to be addressed is: how dynamic can the selection algorithm be? (Should the document be re-evaluated for bandwidth or user language preferences?) This issue is not new for HTML5, since it also must be addressed for <track> and <source>.
The advantage of the SMIL content selection architecture is that it has been in use for years. There is no need or justification for inventing new approaches to solve this problem that provide no new functionality.
The current HTML5 specification provides a well-defined activation framework for a video object. From a SMIL perspective, it has one major disadvantage: it assumes that you always want to play a complete video (or audio) object, unless interrupted by UI control. This has the disadvantage than even simple temporal indexing becomes a user interface issue rather than a basic property of the specification of a video object. For many users with special needs (emotional, physical, intellectual), it is often useful to focus attention on a particular part of a video object. This is also useful, of course, for all video users.
Since its earliest versions, SMIL has provided support for document-driven indexing via the clipBegin attribute. This attribute allows a document author to specify a temporal index within the audio/video object as the starting point for rendering. An example of the use of this attribute in an HTML5 context is:
... <div id="A"> ... <video id="V" src="media.fmt" controls clipBegin="22s" /> ... </div> ...
The SMIL version of the clipBegin attribute allows the attribute value to be a simple time value (such as the 22 seconds in the example) or a full SMPTE time code. (Other alternatives are also possible.)
Although the support of a clipBegin attribute is similar to the use of a temporal fragment identifier in a URL, there are several differences that make clipBegin especially useful for HTML5. These include the fact that no server-side processing required, that no video object modification is required, and that clipBegin fits easily into the existing HTML5 video/audio architecture.
Note that SMIL also support a clipEnd attribute and an explicit duration attribute (dur). Both of these additional attributes have compelling use cases, but my experience is that clipBegin gives the 'biggest bang for the buck'.
The clipBegin attribute can be integrated with any media object specification. It can be used with the <video>/<audio>/<source>/<track> elements, or any other temporal object (such as an SVG animation). Here are examples of the use of clipBegin, and also clipEnd and dur:
... <video src="myVideo.mp4" clipBegin="23s" poster="poster.jpg" controls /> ...
... <video src="myVideo.mp4" clipBegin="23s" clipEnd="00:01:23.0" poster="poster.jpg" controls /> ...
... <video src="myVideo.mp4" clipBegin="23s" dur="58s" poster="poster.jpg" controls /> ...
... <div> <video src="myVideo.mp4" clipBegin="0" poster="chapter1.jpg" controls /> <video src="myVideo.mp4" clipBegin="23s" poster="chapter1.jpg" controls /> <video src="myVideo.mp4" clipBegin="00:12:42.125" poster="chapter2.jpg" controls /> ... <video src="video.mp4" clipBegin="01:28:31.0" poster="chapter26.jpg" controls /> </div>
In this example, a set of video elements are defined, each with their own poster. Each represents a chapter index into a video. Activating any one of the chapters will play the remainder of the video from that point. Although not used in this example, it is also possible to bound the duration of each chapter using clipEnd or dur.
The implementation burden for clipBegin (as well as clipEnd and dur) is trivial: the algorithm for determining the active moment needs to be modified to allow a media element to start at a non-0 index into the file. (It may also extend the algorithm to allow the media to end at either the clipEnd point or the sum of clipBegin and the duration defined by the dur attribute, which ever comes first. The media codec implementations will then need to be able to determine the in/out points based on a mapping of the time specification in the encoding.
One of the immediate temporal aspects of integrating video in a page model is determining when the video should actually start. The options are: on page load, at a specified interval after page load, or as the consequence of an event trigger. In the current HTML5 specification, there is an extensive algorithm that describes how a browser is to provide internal activation support for video. This is an embedded procedure over which UI agents have no control. The start moment of a video can be controlled by scripting, but the only declarative control given is the autoplay attribute. This is a limitation for many accessibility, where the end-user may require more time (or special devices) before activation should begin.
Since the introduction of event-based timing in SMIL (in version 2.0), the interactive begin attribute on a media object has provided a declarative means of specifying the start moment that object:
... <div id="A"> ... <video id="V" src="media.fmt" controls clipBegin="22s" poster="img.jpg" begin="onClick" /> ... </div> ...
(The name of the value of the on-click event is not important; the ability to specify this declaratively is.) In this example, a mouse click (or equivalent) on the poster image would start the video. The begin attribute does not determine when the video element (and the poster) become active, only the moment at which the video content will play if it is available. Note the use of the clipBegin in this example: once started, the video will index to the 22-second point. The specification of the content range is separated from the activation of the event.
In SMIL, we found that a useful extension of the interactive begin semantic was to allow an element identifier to be added to the begin condition:
... <div id="A"> ... <img id="bazinga" src="myDog.jpg" alt="My dog Bazinga"/> <video id="V" src="media.fmt" controls poster="img.jpg" begin="bazinga.onClick" /> ... </div> ...
Here, the video is started when a viewer clicks on the dog image. This can obviously also done via scripting, but the simple declarative extension makes the functionality more accessible for all authors.
SMIL also supports the interactive end attribute: this attribute allows the playing of the video to stop on a click (or other event). Both interactive begin/end have use cases, although the interactive begin (coupled with the controls attribute) would probably be most useful for the initial integration into HTML5. The main advantage of this syntax is that a user can start activation interactively by clicking on either the video or another named object; this is not possible with the controls attribute alone.
A related concern to interactive begin behavior is the restart behavior of the video: does it play only once, can it be restarted after (or during) rendering, and is the restart controlled by an external event -- such as user interaction or a script trigger -- or as the consequence of looping behavior. In the current HTML5 specification, the loop attribute indicates that content should loop continuously. The UI may also be used to start a media object, although control over this behavior is not explicitly defined as a media object attribute. SMIL support the restart attribute for declarative control of restart behavior. It is a useful extension to the declarative media toolset, but is probably not a first-generation video/audio control imperative.
HMTL already supports scripted starting of videos. Adding the begin and end attributes simply provides a declarative interface to making interactive activation and termination easier for meeting a broader set of content authors.
At present, each video element is architected as if it operates on a temporal island. In reality, each such element already exhibits composition-based synchronization behavior: it combines the presentation of video data (in multiple encodings) with poster images and video controls. It is expected with captions/subtitles will also be selectively composed. This composition is both temporal and spatial in nature. Given that composition already takes place within the video element (and, to a lessor degree, the audio element), it is appropriate to take a step back and ask: is the architecture used for video composition in HTML5 clean and extensible, and does it really meet the needs of the accessibility community? >From a SMIL perspective, the answer is: no.
Before giving examples of a more structured approach to content synchronization within HTML5, let's start by recognizing that the integration of SMIL timing semantics into HTML5 is a potentially contentious issue. The perception exists that adding temporal semantics significantly increases the complexity of an HTML5 rendering agent. The perception also exists with SMIL is a bloated standard full of elements and attributes that are irrelevant for HTML. These perceptions are unfortunate, but they are also dangerous: by ignoring existing approaches to unified synchronization support, HTML5 runs the risk of painting itself into a temporal corner. By adopting fairly simple extensions to the language now, a foundation can be built for future growth of the capabilities while at the same time solving real problems that already exist with the temporal aspects of the HTML5 specification. It would be useful if the following sections could be evaluated on their local merit.
This section proposes adding the following set of temporal composition to HTML5:
Each form of composition meets a specific accessibility need, and all can be integrated into the existing HTML5 document model with only incremental implementation effort. (In each instance, the use cases are also broader than accessibility alone.) All forms of composition have been used to address specific accessibility issues in SMIL, all of which transfer directly to HTML5.
To be supplied
To be supplied
To be supplied
To be supplied
To be supplied
To be supplied
To be supplied
To be supplied
To be supplied