W3C home > Mailing lists > Public > public-html-a11y@w3.org > May 2010


From: Dick Bulterman <Dick.Bulterman@cwi.nl>
Date: Tue, 04 May 2010 00:12:12 +0200
Message-ID: <4BDF4A3C.4040400@cwi.nl>
To: HTML Accessibility Task Force <public-html-a11y@w3.org>
After last week's media subgroup phone call (and especially in light of 
Silvia's rather direct personal attacks on me), I've decided that my 
further participation in this group would not be productive. While I am, 
of course, pleased to not to have been accused of harboring weapons of 
mass destruction (although this could come next), it is clear that 
Silvia feels that I've been wasting the group's time without producing 
any constructive feedback. While this posturing may be par for the 
course in HTML5 circles, it is not the way of resolving differences of 
insight to which I've become accustomed in my 15 years of participation 
in W3C working groups. I gladly yield my position to the new generation 
of cowboys (and cowgirls) in the field.

Here are a few closing suggestions regarding Silvia's proposal for 
temporal composition and content control for captions support.

1. The name 'track' for identifying a text object within a video element 
is misleading. It may lead people to think that any arbitrary data type 
could be specified (such as an audio track, an animation track or even a 
secondary video track). Since this proposal is purportedly intended to 
allow the activation of external text tracks only, a more reasonable 
name would be 'textTrack' or 'textStream'.

2. The name 'trackGroup' is equally misleading. In other languages, a 
'group' element is used as to aggregate child elements; here, it is used 
to select child elements. As with 'track' it also gives the impression 
that a choice can be made within a select of general tracks, which is 
not true. A name such as 'textSelect' or 'captionSelect' might be more 
useful. (The 'switch' would only be appropriate if all semantics of the 
SMIL switch were adapted.)

3. The semantics defined by Silvia for managing selection based on 
lexical ordering is not clear to me. It seems that the children are 
first processed to resolve 'source' elements, then 'track' elements (and 
then trackGroups)? What happens when things appear out of order (such as 
  having 'source' elements interspersed among track elements?

4. The assumption that there are no synchronization conflicts between a 
video master and the text children strikes me as overly simplistic: it 
is not practical to simply buffer a full set of captions in all cases. 
Consider mobile phone use: if a given video had captions in 
French/Dutch/English, would all three sets be downloaded before the 
first video frame is displayed? What happens if someone turns on 
captions while the video is active: does the video pause while the 
captions are loaded? It the SRT files are large, significant data 
charges could be incurred, even if the video itself were not played.

I continue to be concerned that overloading text selection and temporal 
alignment within the <video>/<audio> elements is, architecturally, a bad 
idea. By adding explicit temporal structuring (as is already done in 
HTML+Time and in scores of playlist formats), the syntax for selecting 
and integrating text captions would not have to be a special-purpose 
hack. An example (based on HTML+Time syntax available within IE for over 
10 years) is:
    <div timeContainer="par" controls ... >
      <video ...>
        <source .../>
        <source .../>
      <switch systemCaptions="true">
        <textstream src="xxx" systemLanguage="nl" ... />
        <textstream src="yyy" systemLanguage="fr" ... />
        <textstream src="zzz" ... /> <!-- default -->

There is nothing complex about these solutions -- it simply means making 
temporal synchronization explicit. It allows easy extensibility for 
including image slideshows as alternatives to video, or for providing 
different choice in the case of particular screen sizes or connection 
speeds. Naturally, this is not an accessibility-only issue, but history 
has shown that the community of users with special needs are best served 
when a consistent framework exists for managing multiple content 

I first wrote a position paper on this (with concrete suggestions) four 
years ago and submitted it to the HTML lists, but it never got on the 
HTML5 agenda. Since then, I've been told several times that there is no 
time to come up with an appropriate solution for developing a 
compressive model for inter-object synchronization before HTML5 goes to 
last call. (I've been hearing this for about 2 years.) Yet, there is 
time to come up with non-extensible, non-scalable solutions.  There is 
even time to develop yet another timed text model. In this light, I 
think that it is indefensible to ignore structured time within HTML5.

But this is simply my opinion. I realize that it is especially 
appropriate within this group to note that there are none so blind as 
those who will not see (and none so deaf as those who will not hear). I 
will charitably assume that I am one who is blind and deaf, and blocking 
progress to boot. For this reason, my departure is as productive as it 
is timely.

I wish you all well in the process of wrapping up this important work.

Kind regards,
Dick Bulterman
Received on Monday, 3 May 2010 22:12:54 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:05:11 UTC