- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Wed, 5 May 2010 08:44:19 +1000
- To: Dick Bulterman <Dick.Bulterman@cwi.nl>, HTML Accessibility Task Force <public-html-a11y@w3.org>
Dear Dick, all, In light of yesterday's developments with the HTML5 draft [1][2], all the proposals that were made in this group have now somewhat contributed to progress, but are superseded, so a new discussion needs to be had. [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#the-track-element [2] http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-tracks However, I would hate to give a perception that Dick's technical concerns are not being considered by this group, so I've decided to formulate a reply. On Tue, May 4, 2010 at 8:12 AM, Dick Bulterman <Dick.Bulterman@cwi.nl> wrote: > > 1. The name 'track' for identifying a text object within a video element is > misleading. It may lead people to think that any arbitrary data type could > be specified (such as an audio track, an animation track or even a secondary > video track). Since this proposal is purportedly intended to allow the > activation of external text tracks only, a more reasonable name would be > 'textTrack' or 'textStream'. This was discussed before, see e.g. [3] http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0163.html [4] http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0175.html In particular in [4] it is stated that the <track> element is explicitly designed by the group (see the thread at [5]) to allow using it for externally associated, dependent audio or video tracks (in particular with an audio description or a sign language track). [5] http://lists.w3.org/Archives/Public/public-html-a11y/2010Feb/0226.html > 2. The name 'trackGroup' is equally misleading. In other languages, a > 'group' element is used as to aggregate child elements; here, it is used to > select child elements. As with 'track' it also gives the impression that a > choice can be made within a select of general tracks, which is not true. A > name such as 'textSelect' or 'captionSelect' might be more useful. (The > 'switch' would only be appropriate if all semantics of the SMIL switch were > adapted.) This was discussed before, e.g. [6] http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0085.html [7] http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0086.html [8] http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0133.html In particular in [8] it is stated that <trackgroup> is chosen because it is already in use in MPEG files for the identical purpose and thus it made sense to reuse that term, since industry would already understand it. Also, it is discussed that the SMIL <switch> element does not meet all the requirements for this element. > 3. The semantics defined by Silvia for managing selection based on lexical > ordering is not clear to me. It seems that the children are first processed > to resolve 'source' elements, then 'track' elements (and then trackGroups)? > What happens when things appear out of order (such as having 'source' > elements interspersed among track elements? This has indeed not been raised before, but it is actually not a problem, since <source> and <track> do not interfere with each other. The <source> elements are evaluated according to the current specification of HTML5 [9] ignoring any <track> elements. Similarly, our proposal is to evaluate the <track> elements also in tree order [10], which would not interfere with <source>. A <trackgroup> element simply functions like another <track> element, since only one of the <track>s inside it would ever be active. [9] http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-algorithm [10] http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0133.html > 4. The assumption that there are no synchronization conflicts between a > video master and the text children strikes me as overly simplistic: it is > not practical to simply buffer a full set of captions in all cases. Consider > mobile phone use: if a given video had captions in French/Dutch/English, > would all three sets be downloaded before the first video frame is > displayed? What happens if someone turns on captions while the video is > active: does the video pause while the captions are loaded? It the SRT files > are large, significant data charges could be incurred, even if the video > itself were not played. Yes, these are good concerns to discuss. We could have had this discussion here and probably still should. Ian had this discussion also with me and several others when putting his requirements document together [11]. Currently the state of thinking as expressed in [11] is that captions that are active (i.e. have a @active attribute) will be loaded with the video and <video> will not go into the METADATA_LOADED state unless all of this data has been received. Also, if somebody turns on captions during playback, the video will pause until the captions are loaded. I'm not sure I personally agree with the latter - I would prefer "best effort" rather than pausing, but we should discuss this. [11] http://wiki.whatwg.org/wiki/Timed_tracks > I continue to be concerned that overloading text selection and temporal > alignment within the <video>/<audio> elements is, architecturally, a bad > idea. By adding explicit temporal structuring (as is already done in > HTML+Time and in scores of playlist formats), the syntax for selecting and > integrating text captions would not have to be a special-purpose hack. An > example (based on HTML+Time syntax available within IE for over 10 years) > is: > <div timeContainer="par" controls ... > > <video ...> > <source .../> > ... > <source .../> > </video> > <switch systemCaptions="true"> > <textstream src="xxx" systemLanguage="nl" ... /> > <textstream src="yyy" systemLanguage="fr" ... /> > <textstream src="zzz" ... /> <!-- default --> > </switch> > </div> > > There is nothing complex about these solutions -- it simply means making > temporal synchronization explicit. It allows easy extensibility for > including image slideshows as alternatives to video, or for providing > different choice in the case of particular screen sizes or connection > speeds. Naturally, this is not an accessibility-only issue, but history has > shown that the community of users with special needs are best served when a > consistent framework exists for managing multiple content alternatives. This changes the meaning of the <div> element in HTML and thus has wide-ranging implications. It will not be possible to solve it in this way. > I first wrote a position paper on this (with concrete suggestions) four > years ago and submitted it to the HTML lists, but it never got on the HTML5 > agenda. Since then, I've been told several times that there is no time to > come up with an appropriate solution for developing a compressive model for > inter-object synchronization before HTML5 goes to last call. (I've been > hearing this for about 2 years.) Yet, there is time to come up with > non-extensible, non-scalable solutions. There is even time to develop yet > another timed text model. In this light, I think that it is indefensible to > ignore structured time within HTML5. Can you prove that what we are pursuing is non-extensible and non-scalable? I have not seen a single use case that would be inhibited by the current approach and would be curious to see and address it. On the contrary, I believe that inter-object synchronization, i.e. the creation of multimedia experiences that include multiple timelines, images, and user interaction as SMIL does, is at a higher level than what we are currently concerned with. We are focused solely on making the existing <audio> and <video> elements accessible. Once this is solved, it is well possible to introduce a new element that allows the composition of <audio>, <video>, <img> and other elements into a multimedia experience of SMIL dimensions. It is not clear to me whether Canvas might already solve this need, or whether indeed a SMIL-type element is necessary. This is the larger picture that I keep referring to and that would be very interesting to analyse with you. But I cannot see that what we are currently pursuing would interfere or prohibit the solution of this larger picture. If you have an example, please do contribute. Best Regards, Silvia.
Received on Tuesday, 4 May 2010 22:45:12 UTC