Re: timing model of the media resource in HTML5 from Geoff Freed on 2009-11-25 (public-html-a11y@w3.org from November 2009)

From: Geoff Freed <geoff_freed@wgbh.org>
Date: Wed, 25 Nov 2009 11:40:45 -0500
To: Eric Carlson <eric.carlson@apple.com>, Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: HTML Accessibility Task Force <public-html-a11y@w3.org>
Message-ID: <C732C83D.737A%geoff_freed@wgbh.org>
Just a few comments inline.

Geoff/NCAM


On 11/25/09 11:01 AM, "Eric Carlson" <eric.carlson@apple.com> wrote:


On Mon, Nov 23, 2009 at 1:02 PM, Silvia Pfeiffer wrote:

As a three sentence summary:
Basically, I believe that the 90% use case for the Web is that of a
time-linear media resource. Any other, more complex needs, that
require multiple timelines can be realised using JavaScript and the
APIs to audio and video that we still need to define and that will
expose companion tracks to the Web page and therefore to JavaScript. I
don't believe that there will be many use cases that such a
combination cannot satisfy, but if there are, one can always use the
"object" tag and use external plugins to render the Adobe Flash,
Silverlight or SMIL experience to produce this.

BTW: talking about SMIL - I would be very curious to find out if
somebody has tried implementing SMIL in HTML5 and JavaScript yet. I
think much of what a SMIL file defines should now be able to be
presentable in a Web Browser using existing HTML5 and JavaScript
constructs. It would be an interesting exercise and I'd be curious to
hear if somebody has tried and where they found limitations.

 While it is possible to implement some (many?) of the simplest SMIL constructs using HTML, CSS, and JavaScript, things will fall apart as soon as you try to do anything that requires synchronization of media elements (eg. a <par> containing <audio> and <video>). The latency between the JavaScript context and the media engine is just too high to synchronize starting and stopping multiple elements, never mind the clipping and sync behavior attributes.


On Nov 25, 2009, at 5:29 AM, Silvia Pfeiffer wrote:


On Wed, Nov 25, 2009 at 11:24 PM, Philip Jägenstedt <philipj@opera.com> wrote:

I agree that syncing separate video and audio files is a big challenge. I'd
prefer leaving this kind of complexity either to scripting or an external
manifest like SMIL.

We have to at minimum deal with multi-track video and audio files
inside HTML, since they can potentially expose accessibility data:
audio descriptions (read by a human), sign language (signed by a
person), and captions are the particular tracks I am concerned about.

 I agree that we must support multi-track audio and video files. If a container format permits references to external media files (as QuickTime does) it is the job of the media engine to keep them in sync so we don't need to worry about it.

GF:  I agree, and also want to reiterate that maintaining external files for things like captions or descriptions is a lot easier than dealing with embedded files.  Syncing with external resources is, in my view, the best way to go.


The Google accessibility experts wanted at least the in-line caption
tracks exposed in declarative language. This is because otherwise you
cannot build a menu of all available tracks without having to start
downloading and decoding the file.  With this in mind, I think we have
to expose all of the tracks available in a file in declarative
language.

GF:  Agreed.


Yes, this works for external additional tracks. Maybe then we can add
the internal tracks inside the source elements, something like this:

 <video>
  <source src="video.ogg" type="video/ogg">
    <track id='v' role='video' ref='serialno:1505760010'>
    <track id='a' role='audio' lang='en' ref='serialno:0821695999'>
    <track id='ad' role='auddesc' lang='en' ref='serialno:1421614520'>
    <track id='s' role='sign' lang='ase' ref='serialno:1413244634'>
    <track id='cc' role='caption' lang='en' ref='serialno:1421849818'>
  </source>
  <source src="video.mp4" type="video/mp4">
    <track id='v' role='video' ref='trackid:1'>
    <track id='a' role='audio' lang='en' ref='trackid:2'>
  </source>
  <overlay>
    <source src="en.srt" lang="en-US">
    <source src="hans.srt" lang="zh-CN">
  </overlay>
 </video>

Note I have made the track reference explicit through introducing a
new "ref" attribute which uses encapsulation format specific
references to track identifiers.

 I *really* don't like the idea of requiring page authors to declare the track structure in the markup. It seems to me that because it will require new specialized tools get the information, and because it will be really difficult to do correctly (ten digit serial numbers?), people are likely to just skip it completely. We need to create a specification that makes it as simple as possible for people to do the right thing.

GF:  Right- the idea of having to discover and include a complex structural element sounds offputting for authors.  Identifying the resource as a caption/description/subtitle, then pointing directly to it would be simple and not unlike what authors are used to with SMIL.

  If we do allow this, what happens when the structure declared in the markup differs from the structure of the media file?


On Wed, Nov 25, 2009 at 11:24 PM, Philip Jägenstedt <philipj@opera.com> wrote:

centerTop/centerBottom are appropriately defined in CSS.

Those are almost like the default styling approaches I suggested for
itextlist / itext.[2] There, I also assumed there was a display area
as large as the video or actually just a little larger available to
render the time-aligend text into. It's larger since sometimes it is
better not to overlay stuff but to place it right next to the video,
e.g. just above it (title-like) or just below it but visually part of
the video window.

 We shouldn't make assumptions about the size of an overlay, if someone want to display something outside of the media element's bounds they can use CSS.

GF:  This would also take into account the possibility that someone would want to create a caption region that is actually wider than the video region- useful when the video region itself is too small to contain a useful amount of overlayed caption or subtitle text.

eric
Received on Wednesday, 25 November 2009 16:41:26 UTC