- From: Philip Jägenstedt <philipj@opera.com>
- Date: Sat, 13 Mar 2010 12:20:14 +0800
- To: "Dick Bulterman" <Dick.Bulterman@cwi.nl>, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
- Cc: "Michael(tm) Smith" <mike@w3.org>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>
On Sat, 13 Mar 2010 07:25:59 +0800, Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote: > On Sat, Mar 13, 2010 at 12:32 AM, Dick Bulterman <Dick.Bulterman@cwi.nl> > wrote: >> Hi Silvia, >> >> You wrote: >>> >>> If we really wanted to have a possibility of compositing multimedia >>> presentations from multiple resources in a flexible manner, we should >>> not be using <audio> or <video> for it, but rather introduce SMIL - >>> just like we are not compositing image resources in the <img> element, >>> but in other elements, such as <canvas> or through JavaScript. >> >> I think you are under-estimating what you are trying to achieve (as >> least as >> far as I can follow the discussions so far). >> >> If 'all' you were trying to do was to turn on one of the two or three >> pre-recorded, pre-packaged, pre-composed optional tracks within an Ogg >> or >> mp4 container, then I'd say: fine, this is a local media issue that can >> be >> handled within the context of a single <video> or <audio> element. But >> this >> is NOT what you are doing: you are referencing external text files (srt, >> smilText, DFXP, whatever) -- these need to be composed temporally and >> spatially with the content in the video or audio object. Once you head >> down >> this path, you are no longer looking at local manipulations WITHIN a >> media >> object, but activating objects from multiple sources. > > There is no difference if these tracks come from within a file or from > external, except that with external tracks we have all the data > available at once, while with internal tracks we get them in chunks > together with the media data. In either case you have a pipeline for > decoding the video, a pipeline for decoding the audio and a "pipeline" > for decoding the text and your media system synchronises them. The > need for temporal and spatial composition is not different whether the > text comes from within the file as when it comes from external. In > fact, it is easier to deal with the external case because its data is > available fully from the start and temporal synchronisiation, seeking > etc is easier to solve. Applications such as VLC, mplayer and many > others that are able to synchronise external captions with media > resources have shown it for years. They do not need a sophisticated > framework for activating objects from multiple sources. > > >> This is why I really believe that you need to look at a more scalable >> solution to this problem -- not because I want to impose SMIL on you, >> but >> because you are imposing temporal and spatial problems on yourself in >> the >> context of composing separate media objects. > > Believe me I know what I am talking about - I have done this many > times before. Going down your line of thought will introduce a > complexity for the media elements that they are not built for and not > intended for. An <img> element is not built for spatial combination of > multiple images and text objects either, but it does have a solution > for accessibility. This is our focus here, too. > > Your line of thought really has to be done by introducing a different > element and it may indeed be an idea to consider something like a > mediacanvas for composing separate media objects - the current canvas > is not built for dealing with the temporal synchronisation issues that > such a solution requires and neither are the current media elements. > > >> As an aside: One of the benefits of this approach is that it means that >> you >> get broader selectivity for free on other objects -- this increases >> accessibility options at no extra cost. (Note that whether you view the >> controlling syntax in declarative or scripted terms is an orthogonal >> concern.) > > Now, I think you are under-estimating the complexity. Marking this up > is the easy bit. But implementing it is the hard bit. > > There is indeed a massive "extra cost" for introducing <seq>, <par>, > <excl> etc.. You need to introduce a whole media and network > synchronisation sub-system that e.g. tracks what time we are up to in > every single resource, whether they are all buffered correctly, no > network connection is stalling, enough data available from each, > determine which are to be activated based on several dimensions of > conditions, make sure the overlapping is done correctly and what to do > with gaps. As it stands, the number of events on a single media > resource is already enormous (see > http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#mediaevents). > With multiple dynamically composed resources and no main resource this > will explode exponentially. It is a complexity that comes at a high > price and is overkill for the current media elements. > > In contrast, the current media elements have one master and it is > providing the temporal timeline. Everything else synchronises to it. > It provides the main audio or video and there are no gaps. Activation > is simple and need to be calculated only once at the beginning of the > playback and changed only when the user interacts. It's massively less > complex. > > Honestly, keeping the composing of several media elements separate > from dealing with basically a single media resource as we are right > now is the right way to go. It follows the divided an conquer > principle: once this is solved and solid, it will be easier to develop > an approach to solving your requirements. Thanks Silvia for explaining this better than I could have. Here is my view on the two issues mentioned: 1. Temporal composition As Silvia says there is a main timeline to which everything is synced, no extra syntax is needed to express this. Without any examples or use cases I can only guess that there might be two other things of interest (in theory): 1.1. Scaling or offsetting the <track> relative to the original resource, e.g. to fix broken caption timing or include a sign-language video track for a small part of a long video. I don't suggest including this now, but if we need it later the solution is simple: <track offset="40" rate="1.5"> 1.2. Concatenating several media resources back-to-back with gapless playback. This would be a big implementation burden. If implementations aren't gapless, then it will be no better than doing it via script, as such: <video id="video1" src="video1.ogv" onended="document.getElementById('video2').play()"></video> <video id="video2" src="video2.ogv"></video> In the very unlikely case that gapless playback becomes a must-have feature and implementors are actually willing to support it, we can just let <audio> and <video> sync to another timeline. The mediacanvas Silvia suggests can be the HTML document itself: <video id="video1" src="video1.ogv"></video> <video id="video2" src="video2.ogv" timeline="video1" offset="end:0"></video> I just made this up. We could copy whatever terminology SMIL uses if it is better. The point is, and I quote Silvia: > Marking this up is the easy bit. But implementing it is the hard bit. 2. spatial composition CSS, always! -- Philip Jägenstedt Core Developer Opera Software
Received on Saturday, 13 March 2010 04:21:00 UTC