Re: [media] A technical reply to Dick's concerns (was Re: farewell)

On Wed, 05 May 2010 06:44:19 +0800, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> Dear Dick, all,
>
> In light of yesterday's developments with the HTML5 draft [1][2], all
> the proposals that were made in this group have now somewhat
> contributed to progress, but are superseded, so a new discussion needs
> to be had.
>
> [1]  
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#the-track-element
> [2]  
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-tracks
>
>
> However, I would hate to give a perception that Dick's technical
> concerns are not being considered by this group, so I've decided to
> formulate a reply.
>
>
> On Tue, May 4, 2010 at 8:12 AM, Dick Bulterman <Dick.Bulterman@cwi.nl>  
> wrote:
>>
>> 1. The name 'track' for identifying a text object within a video  
>> element is
>> misleading. It may lead people to think that any arbitrary data type  
>> could
>> be specified (such as an audio track, an animation track or even a  
>> secondary
>> video track). Since this proposal is purportedly intended to allow the
>> activation of external text tracks only, a more reasonable name would be
>> 'textTrack' or 'textStream'.
>
> This was discussed before, see e.g.
> [3]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0163.html
> [4]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0175.html
>
> In particular in [4] it is stated that the <track> element is
> explicitly designed by the group (see the thread at [5]) to allow
> using it for externally associated, dependent audio or video tracks
> (in particular with an audio description or a sign language track).
>
> [5]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Feb/0226.html
>
>
>> 2. The name 'trackGroup' is equally misleading. In other languages, a
>> 'group' element is used as to aggregate child elements; here, it is  
>> used to
>> select child elements. As with 'track' it also gives the impression  
>> that a
>> choice can be made within a select of general tracks, which is not  
>> true. A
>> name such as 'textSelect' or 'captionSelect' might be more useful. (The
>> 'switch' would only be appropriate if all semantics of the SMIL switch  
>> were
>> adapted.)
>
> This was discussed before, e.g.
> [6]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0085.html
> [7]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Apr/0086.html
> [8]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0133.html
>
> In particular in [8] it is stated that <trackgroup> is chosen because
> it is already in use in MPEG files for the identical purpose and thus
> it made sense to reuse that term, since industry would already
> understand it. Also, it is discussed that the SMIL <switch> element
> does not meet all the requirements for this element.
>
>
>> 3. The semantics defined by Silvia for managing selection based on  
>> lexical
>> ordering is not clear to me. It seems that the children are first  
>> processed
>> to resolve 'source' elements, then 'track' elements (and then  
>> trackGroups)?
>> What happens when things appear out of order (such as  having 'source'
>> elements interspersed among track elements?
>
> This has indeed not been raised before, but it is actually not a
> problem, since <source> and <track> do not interfere with each other.
> The <source> elements are evaluated according to the current
> specification of HTML5 [9] ignoring any <track> elements. Similarly,
> our proposal is to evaluate the <track> elements also in tree order
> [10], which would not interfere with <source>. A <trackgroup> element
> simply functions like another <track> element, since only one of the
> <track>s inside it would ever be active.
>
> [9]  
> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-algorithm
> [10]  
> http://lists.w3.org/Archives/Public/public-html-a11y/2010Mar/0133.html
>
>
>> 4. The assumption that there are no synchronization conflicts between a
>> video master and the text children strikes me as overly simplistic: it  
>> is
>> not practical to simply buffer a full set of captions in all cases.  
>> Consider
>> mobile phone use: if a given video had captions in French/Dutch/English,
>> would all three sets be downloaded before the first video frame is
>> displayed? What happens if someone turns on captions while the video is
>> active: does the video pause while the captions are loaded? It the SRT  
>> files
>> are large, significant data charges could be incurred, even if the video
>> itself were not played.
>
> Yes, these are good concerns to discuss. We could have had this
> discussion here and probably still should. Ian had this discussion
> also with me and several others when putting his requirements document
> together [11].
>
> Currently the state of thinking as expressed in [11] is that captions
> that are active (i.e. have a @active attribute) will be loaded with
> the video and <video> will not go into the METADATA_LOADED state
> unless all of this data has been received. Also, if somebody turns on
> captions during playback, the video will pause until the captions are
> loaded.
>
> I'm not sure I personally agree with the latter - I would prefer "best
> effort" rather than pausing, but we should discuss this.
>
> [11] http://wiki.whatwg.org/wiki/Timed_tracks
>
>
>> I continue to be concerned that overloading text selection and temporal
>> alignment within the <video>/<audio> elements is, architecturally, a bad
>> idea. By adding explicit temporal structuring (as is already done in
>> HTML+Time and in scores of playlist formats), the syntax for selecting  
>> and
>> integrating text captions would not have to be a special-purpose hack.  
>> An
>> example (based on HTML+Time syntax available within IE for over 10  
>> years)
>> is:
>>   <div timeContainer="par" controls ... >
>>     <video ...>
>>       <source .../>
>>       ...
>>       <source .../>
>>     </video>
>>     <switch systemCaptions="true">
>>       <textstream src="xxx" systemLanguage="nl" ... />
>>       <textstream src="yyy" systemLanguage="fr" ... />
>>       <textstream src="zzz" ... /> <!-- default -->
>>     </switch>
>>   </div>
>>
>> There is nothing complex about these solutions -- it simply means making
>> temporal synchronization explicit. It allows easy extensibility for
>> including image slideshows as alternatives to video, or for providing
>> different choice in the case of particular screen sizes or connection
>> speeds. Naturally, this is not an accessibility-only issue, but history  
>> has
>> shown that the community of users with special needs are best served  
>> when a
>> consistent framework exists for managing multiple content alternatives.
>
> This changes the meaning of the <div> element in HTML and thus has
> wide-ranging implications. It will not be possible to solve it in this
> way.
>
>
>> I first wrote a position paper on this (with concrete suggestions) four
>> years ago and submitted it to the HTML lists, but it never got on the  
>> HTML5
>> agenda. Since then, I've been told several times that there is no time  
>> to
>> come up with an appropriate solution for developing a compressive model  
>> for
>> inter-object synchronization before HTML5 goes to last call. (I've been
>> hearing this for about 2 years.) Yet, there is time to come up with
>> non-extensible, non-scalable solutions.  There is even time to develop  
>> yet
>> another timed text model. In this light, I think that it is  
>> indefensible to
>> ignore structured time within HTML5.
>
> Can you prove that what we are pursuing is non-extensible and
> non-scalable? I have not seen a single use case that would be
> inhibited by the current approach and would be curious to see and
> address it.
>
> On the contrary, I believe that inter-object synchronization, i.e. the
> creation of multimedia experiences that include multiple timelines,
> images, and user interaction as SMIL does, is at a higher level than
> what we are currently concerned with. We are focused solely on making
> the existing <audio> and <video> elements accessible. Once this is
> solved, it is well possible to introduce a new element that allows the
> composition of <audio>, <video>, <img> and other elements into a
> multimedia experience of SMIL dimensions. It is not clear to me
> whether Canvas might already solve this need, or whether indeed a
> SMIL-type element is necessary. This is the larger picture that I keep
> referring to and that would be very interesting to analyse with you.
> But I cannot see that what we are currently pursuing would interfere
> or prohibit the solution of this larger picture. If you have an
> example, please do contribute.
>
>
> Best Regards,
> Silvia.
>

For the record, I agree with Silvia's assessment of the above issues. I  
have taken part in the mailing list discussions and haven't seen anything  
but dispassionate technical replies to Dick's concerns, from Silvia or  
anyone else. I hope Dick might reconsider and continue contributing use  
cases, suggestions and criticism on the existing and emerging specs.

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Wednesday, 5 May 2010 04:02:36 UTC