[media] A technical reply to Dick's concerns (was Re: farewell)

Dear Dick, all,

In light of yesterday's developments with the HTML5 draft [1][2], all
the proposals that were made in this group have now somewhat
contributed to progress, but are superseded, so a new discussion needs
to be had.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#the-track-element
[2] http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-tracks


However, I would hate to give a perception that Dick's technical
concerns are not being considered by this group, so I've decided to
formulate a reply.


On Tue, May 4, 2010 at 8:12 AM, Dick Bulterman <Dick.Bulterman@cwi.nl> wrote:
>
> 1. The name 'track' for identifying a text object within a video element is
> misleading. It may lead people to think that any arbitrary data type could
> be specified (such as an audio track, an animation track or even a secondary
> video track). Since this proposal is purportedly intended to allow the
> activation of external text tracks only, a more reasonable name would be
> 'textTrack' or 'textStream'.

This was discussed before, see e.g.
[3] http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0163.html
[4] http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0175.html

In particular in [4] it is stated that the <track> element is
explicitly designed by the group (see the thread at [5]) to allow
using it for externally associated, dependent audio or video tracks
(in particular with an audio description or a sign language track).

[5] http://lists.w3.org/Archives/Public/public-html-a11y/2010Feb/0226.html


> 2. The name 'trackGroup' is equally misleading. In other languages, a
> 'group' element is used as to aggregate child elements; here, it is used to
> select child elements. As with 'track' it also gives the impression that a
> choice can be made within a select of general tracks, which is not true. A
> name such as 'textSelect' or 'captionSelect' might be more useful. (The
> 'switch' would only be appropriate if all semantics of the SMIL switch were
> adapted.)

This was discussed before, e.g.
[6] http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0085.html
[7] http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0086.html
[8] http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0133.html

In particular in [8] it is stated that <trackgroup> is chosen because
it is already in use in MPEG files for the identical purpose and thus
it made sense to reuse that term, since industry would already
understand it. Also, it is discussed that the SMIL <switch> element
does not meet all the requirements for this element.


> 3. The semantics defined by Silvia for managing selection based on lexical
> ordering is not clear to me. It seems that the children are first processed
> to resolve 'source' elements, then 'track' elements (and then trackGroups)?
> What happens when things appear out of order (such as  having 'source'
> elements interspersed among track elements?

This has indeed not been raised before, but it is actually not a
problem, since <source> and <track> do not interfere with each other.
The <source> elements are evaluated according to the current
specification of HTML5 [9] ignoring any <track> elements. Similarly,
our proposal is to evaluate the <track> elements also in tree order
[10], which would not interfere with <source>. A <trackgroup> element
simply functions like another <track> element, since only one of the
<track>s inside it would ever be active.

[9] http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-algorithm
[10] http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0133.html


> 4. The assumption that there are no synchronization conflicts between a
> video master and the text children strikes me as overly simplistic: it is
> not practical to simply buffer a full set of captions in all cases. Consider
> mobile phone use: if a given video had captions in French/Dutch/English,
> would all three sets be downloaded before the first video frame is
> displayed? What happens if someone turns on captions while the video is
> active: does the video pause while the captions are loaded? It the SRT files
> are large, significant data charges could be incurred, even if the video
> itself were not played.

Yes, these are good concerns to discuss. We could have had this
discussion here and probably still should. Ian had this discussion
also with me and several others when putting his requirements document
together [11].

Currently the state of thinking as expressed in [11] is that captions
that are active (i.e. have a @active attribute) will be loaded with
the video and <video> will not go into the METADATA_LOADED state
unless all of this data has been received. Also, if somebody turns on
captions during playback, the video will pause until the captions are
loaded.

I'm not sure I personally agree with the latter - I would prefer "best
effort" rather than pausing, but we should discuss this.

[11] http://wiki.whatwg.org/wiki/Timed_tracks


> I continue to be concerned that overloading text selection and temporal
> alignment within the <video>/<audio> elements is, architecturally, a bad
> idea. By adding explicit temporal structuring (as is already done in
> HTML+Time and in scores of playlist formats), the syntax for selecting and
> integrating text captions would not have to be a special-purpose hack. An
> example (based on HTML+Time syntax available within IE for over 10 years)
> is:
>   <div timeContainer="par" controls ... >
>     <video ...>
>       <source .../>
>       ...
>       <source .../>
>     </video>
>     <switch systemCaptions="true">
>       <textstream src="xxx" systemLanguage="nl" ... />
>       <textstream src="yyy" systemLanguage="fr" ... />
>       <textstream src="zzz" ... /> <!-- default -->
>     </switch>
>   </div>
>
> There is nothing complex about these solutions -- it simply means making
> temporal synchronization explicit. It allows easy extensibility for
> including image slideshows as alternatives to video, or for providing
> different choice in the case of particular screen sizes or connection
> speeds. Naturally, this is not an accessibility-only issue, but history has
> shown that the community of users with special needs are best served when a
> consistent framework exists for managing multiple content alternatives.

This changes the meaning of the <div> element in HTML and thus has
wide-ranging implications. It will not be possible to solve it in this
way.


> I first wrote a position paper on this (with concrete suggestions) four
> years ago and submitted it to the HTML lists, but it never got on the HTML5
> agenda. Since then, I've been told several times that there is no time to
> come up with an appropriate solution for developing a compressive model for
> inter-object synchronization before HTML5 goes to last call. (I've been
> hearing this for about 2 years.) Yet, there is time to come up with
> non-extensible, non-scalable solutions.  There is even time to develop yet
> another timed text model. In this light, I think that it is indefensible to
> ignore structured time within HTML5.

Can you prove that what we are pursuing is non-extensible and
non-scalable? I have not seen a single use case that would be
inhibited by the current approach and would be curious to see and
address it.

On the contrary, I believe that inter-object synchronization, i.e. the
creation of multimedia experiences that include multiple timelines,
images, and user interaction as SMIL does, is at a higher level than
what we are currently concerned with. We are focused solely on making
the existing <audio> and <video> elements accessible. Once this is
solved, it is well possible to introduce a new element that allows the
composition of <audio>, <video>, <img> and other elements into a
multimedia experience of SMIL dimensions. It is not clear to me
whether Canvas might already solve this need, or whether indeed a
SMIL-type element is necessary. This is the larger picture that I keep
referring to and that would be very interesting to analyse with you.
But I cannot see that what we are currently pursuing would interfere
or prohibit the solution of this larger picture. If you have an
example, please do contribute.


Best Regards,
Silvia.

Received on Tuesday, 4 May 2010 22:45:12 UTC