Re: timing model of the media resource in HTML5

On Wed, 25 Nov 2009 03:52:26 +0100, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> Thought I should share the first feedback that I got:
>
> A first comment I got from actual HTML5 video element implementers of
> two browser vendors: browsers aren't designed to work with and
> synchronise multiple sources of audio and video files.
>
> For example, dealing with network latency, buffering, the server going
> offline, seeking, not to forget cross domain security issues,
> specifying behaviour if resources aren't available, aren't the same
> duration, events for things like stalling on individual files, etc -
> they are all too hard to solve with current browser technology.
>
> So, it seems to me we have to restrict ourselves for this version of
> HTML to dealing with multi-track audio-visual resources only where
> they are provided in a single file, since otherwise we may create a
> specification that nobody will implement and we don't win with that.
>
>
> A second comment was that the source elements are currently regarded
> as mutually exclusive elements, so the approach of regarding them as
> tracks that are potentially adding to each other won't work. I might
> need to make a new specification where the manifest is inside the
> <source> elements.
>
>
> Cheers,
> Silvia.
>
> On Wed, Nov 25, 2009 at 12:31 PM, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com> wrote:
>> Hi all,
>>
>> Just to follow up on this already very intensive reading material, I
>> have now done a more concrete post that takes the ideas from the first
>> post and applies it to the video element.
>>
>> See  
>> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
>> .
>>
>> It is a long post, so you will need to have some patience, but I would
>> appreciate feedback.
>>
>> The idea this time is to extend the usefulness of the existing
>> "source" elements rather than introducing new elements such as "itext"
>> to provide us with the functionality of multi-track media resources
>> (even as they are virtual media resources in the meaning defined in
>> the first blog post). An example would be:
>>
>>  <video>
>>    <source src='video.ogv' type='video/ogg' media='desktop' lang='en'
>>                     role='media' >
>>    <source src='video.ogv?track=auddesc[en]' type='audio/ogg' lang='en'
>>                     role='auddesc' >
>>    <source src='audiodesc_de.oga' type='audio/ogg' lang='de'
>>                     role='auddesc' >
>>    <source src='video.mp4?track=caption[en]' type='application/ttaf+xml'
>>                     lang='en' role='caption' >
>>    <source src='video.ogv?track=caption[de]' type='text/srt;
>> charset="ISO-8859-1"'
>>                     lang='de' role='caption' >
>>    <source src='caption_ja.ttaf' type='application/ttaf+xml' lang='ja'
>>                     role='caption' >
>>    <source src='signvid_ase.ogv' type='video/ogg; codecs="theora"'
>>                     media='desktop' lang='ase' role='sign' >
>>    <source src='signvid_gsg.ogv' type='video/ogg; codecs="theora"'
>>                     media='desktop' lang='gsg' role='sign' >
>>    <source src='signvid_sfs.ogv' type='video/ogg; codecs="theora"'
>>                     media='desktop' lang='sfs' role='sign' >
>>  </video>
>>
>> which is a composite virtual media resource with two audio description
>> tracks, three caption tracks and three sign language video tracks.
>>
>> There are some issues raised with that approach in the new post and I
>> am looking for feedback on how to potentially solve them.
>>
>> Best Regards,
>> Silvia.
>>
>>
>> On Mon, Nov 23, 2009 at 1:02 PM, Silvia Pfeiffer
>> <silviapfeiffer1@gmail.com> wrote:
>>> Hi all,
>>>
>>> I'd like to start discussions about accessibility in media elements
>>> for HTML5 by going all the way back and answering the fundamental
>>> question that Dick Bulterman posed at the recent (well, not so recent
>>> any more) Video Accessibility workshop. He stated that HTML5 hasn't
>>> got a timing model for the media elements and that a discussion about
>>> the timing model needs to be had.
>>>
>>> To start off this discussion, I have written a blog post that explains
>>> where I think things are at. It has turned out to be a rather long
>>> blog post, so I'd rather not copy and paste it into the discussion
>>> here. You can read it at
>>> http://blog.gingertech.net/2009/11/23/model-of-a-time-linear-media-resource/
>>> .
>>>
>>> If you disagree/agree/want to discuss any of the things I stated
>>> there, please copy the relevant paragraph and quote it into this
>>> thread, so we can all know what we are discussing. (I guess, Google
>>> Wave would come in hand here..)
>>>
>>> As a three sentence summary:
>>> Basically, I believe that the 90% use case for the Web is that of a
>>> time-linear media resource. Any other, more complex needs, that
>>> require multiple timelines can be realised using JavaScript and the
>>> APIs to audio and video that we still need to define and that will
>>> expose companion tracks to the Web page and therefore to JavaScript. I
>>> don't believe that there will be many use cases that such a
>>> combination cannot satisfy, but if there are, one can always use the
>>> "object" tag and use external plugins to render the Adobe Flash,
>>> Silverlight or SMIL experience to produce this.
>>>
>>> BTW: talking about SMIL - I would be very curious to find out if
>>> somebody has tried implementing SMIL in HTML5 and JavaScript yet. I
>>> think much of what a SMIL file defines should now be able to be
>>> presentable in a Web Browser using existing HTML5 and JavaScript
>>> constructs. It would be an interesting exercise and I'd be curious to
>>> hear if somebody has tried and where they found limitations.
>>>
>>> Best Regards,
>>> Silvia.
>>>
>>
>

I agree that syncing separate video and audio files is a big challenge.  
I'd prefer leaving this kind of complexity either to scripting or an  
external manifest like SMIL. Below I focus on the HTML-specific parts:

Captions/subtitles... The main problem of reusing <source> is that it  
doesn't work with the resource selection algorithm.[1] However, that  
algorithm only considers direct children of the media element, so adding a  
wrapping element would solve this problem and allow us to spec different  
rules for selecting timed-text sources. Example:

<video>
   <source src="video.ogg" type="video/ogg">
   <source src="video.mp4" type="video/mp4">
   <overlay>
     <source src="en.srt" lang="en-US">
     <source src="hans.srt" lang="zh-CN">
   </overlay>
</video>

We could possibly allow <overlay src="english.srt"></overlay> as a  
shorthand when there is only one captions file, just like the video <video  
src=""></video> shorthand.

I'm suggesting <overlay> instead of e.g. <itext> because I have some  
special behavior in mind: when no (usable) source is found in <overlay>,  
the content of the element should be displayed overlayed on top of the  
video element as if it were inside a CSS box of the same size as the  
video. This gives authors a simple way to display overlay content such as  
custom controls and complex "subtitles" like animated karaoke to work the  
same both in normal rendering and in fullscreen mode. (I don't know what  
kind of CSS spec magic would be needed to allow such rendering, but I  
don't believe overlaying the content is very difficult  
implementation-wise.)

Naturally, CSS is used to style the captions:

<video src="video.ogg">
   <overlay src="en.srt"  
style="font-size:2em;padding:1em;text-align:center"></overlay>
</video>

If there is a use case, displaying several captions/subtitles at once  
could be allowed as such:

<video src="video.ogg">
   <overlay src="en.srt" class="centerTop"></overlay>
   <overlay src="hans.srt" class="centerBottom"></overlay>
</video>

centerTop/centerBottom are appropriately defined in CSS.

For what it's worth, it's easy to get this behavior (sans fullscreen)  
using scripting today, simply by cloning/moving the overlay elements  
outside of <vide> and positioning them on top using CSS. Even SRT  
retrieval (XHR), decoding (RegExp) and syncing (timeupdate event) is easy  
enough to do.

Comments?

[1]  
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-algorithm

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Wednesday, 25 November 2009 12:25:32 UTC