Re: timing model of the media resource in HTML5 from Silvia Pfeiffer on 2010-01-27 (public-html-a11y@w3.org from January 2010)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Wed, 27 Jan 2010 22:57:51 +1100
To: Philip Jägenstedt <philipj@opera.com>, Eric Carlson <eric.carlson@apple.com>
Cc: HTML Accessibility Task Force <public-html-a11y@w3.org>, Ken Harrenstien <klh@google.com>
Message-ID: <2c0e02831001270357u47829059w553dafc57cf4e70@mail.gmail.com>
Hi,

I've spent the last day and a bit trying to catch up on this whole
conversation and it's going to be a bit difficult to give good
feedback without replying to several emails. I will try Ian's approach
of cutting an pasting relevant bits from different emails to get
something of a consistent discussion together again. Sorry if that's
confusing.

The first part of this email will focus on how to expose the track
composition to the UA & javascript.
The second part will focus on the <overlay> proposal.


On Thu, Nov 26, 2009 at 3:26 AM, Philip Jägenstedt <philipj@opera.com> wrote:
> On Wed, 25 Nov 2009 14:29:37 +0100, Silvia Pfeiffer
>> On Wed, Nov 25, 2009 at 11:24 PM, Philip Jägenstedt <philipj@opera.com>
>> wrote:
>>>
>>> I agree that syncing separate video and audio files is a big challenge.
>>> I'd
>>> prefer leaving this kind of complexity either to scripting or an external
>>> manifest like SMIL.
>>
>> We have to at minimum deal with multi-track video and audio files
>> inside HTML, since they can potentially expose accessibility data:
>> audio descriptions (read by a human), sign language (signed by a
>> person), and captions are the particular tracks I am concerned about.
>
> I agree and think that the tracks of the resource should be exposed via a
> DOM API. From a scripts point of view it should look the same whether the
> resource is Ogg, MPEG-4 or SMIL linking several tracks together.

I agree with this, which is why I tried to create a markup that has
elements inside the <source> element - the idea is to allow expression
of the contained tracks of a media resource explicitly in declarative
markup, such that the DOM API is obvious and javascript can deal with
it.

Let me cite my proposal again and clarify some things / ask some questions:

 <video>
 <source src="video.ogv" type="video/ogg">
   <track id='ogg_v' role='video' ref='serialno:1505760010'>
   <track id='ogg_a' role='audio' lang='en' ref='serialno:0821695999'>
   <track id='ogg_ad' role='auddesc' lang='en' ref='serialno:1421614520'>
   <track id='ogg_s' role='sign' lang='ase' ref='serialno:1413244634'>
   <track id='ogg_cc' role='caption' lang='en' ref='serialno:1421849818'>
 </source>
<source src="video.mp4" type="video/mp4">
   <track id='mp4_v' role='video' ref='trackid:1'>
   <track id='mp4_a' role='audio' lang='en' ref='trackid:2'>
 </source>
 <overlay>
   <source src="en.srt" lang="en-US">
   <source src="hans.srt" lang="zh-CN">
 </overlay>
</video>

Eric said:
> I *really* don't like the idea of requiring page authors to declare the
> track structure in the markup.

And Philip added to this:
> I really don't see a problem with waiting until
> metadataloaded for the menu to be available. Picking a language in the < 1
> sec before that seems like a fringe use case which can be solved by sending
> the information in an site-specific format using data-* attributes or
> similar.

Ken Harrenstien from Google wrote this to me (and allowed me to quote
him, which is why I cc-ed him):
> The principal reason for wanting to allow explicit markup is latency
> and infrastructure overhead.
>
> Without the markup, the only way to know what's in-band is to start
> streaming the video.  How long will it take to find out what kinds of
> captions it contains and whether they are supported?  How much
> bandwidth and setup is wasted in the process?  At Google we care very
> deeply about those things.
>
> I think this information is very, if not exactly, analogous to the
> other markup provided for <video>. I need it to tell immediately if the video is even
> playable/watchable for me (as a hearing-impaired person).

I believe he has a strong case.

Further, if the media elements will indeed change from using
@autobuffer to using @preload, where prefetching of no video data is
possible, then the UA has to be told in some other way what the
resource composition is. After all, the UA should display to the user
what accessibility tracks are available and allow the user to turn
them on/off (suggested to happen through a menu that is built by the
UA and added to the video transport bar).

Also, it is really important to expose the role (and the language)
that a track takes on within a multitrack media file, such that a UA
can decide whether to display a track or not and where to display it.
I do believe that the control of which tracks are being displayed
should stay with the UA and not be forced by the file or the media
framework. I cannot see a better way for exposing this functionality
uniformly across multiple media file types other than explicit markup.

If we buried the track information in a javascript API, we would
introduce an additional dependency and we would remove the ability to
simply parse the Web page to get at such information. For example, a
crawler would not be able to find out that there is a resource with
captions and would probably not bother requesting the resource for its
captions (or other text tracks).

Eric further said:
> It seems to me that because it will require
> new specialized tools get the information, and because it will be really
> difficult to do correctly (ten digit serial numbers?), people are likely to
> just skip it completely.

There is a need for addressing the track in a unique way, i.e.
javascript needs to be able to tell the media framework exactly which
track it is talking about (e.g. to turn it on or off).

In Ogg, the serial numbers of each track are ten digit numbers. They
can be easily obtained using ogg-info or oggz-info and can be easily
exposed by players. There is currently no other way of uniquely
identifying a specific track in Ogg. (On a side note: We are working
with Xiph to require encoders to also give each track in an Ogg file a
unique name so it can be addressed through this, but this is not
currently the case.)

For MPEG, I believe the tracks are numbered through, so it is easier
to identify them (though also easier to make mistakes).

Eric further stated:
> We need to create a specification that makes it as
> simple as possible for people to do the right thing.

Mostly this information will be created by tools anyway (typically a
CMS), such that it's not up to the user to do this.

Also, there is no need for a user to do this - it's optional and the
ordinary user will most likely not produce in-band captions and audio
descriptions for their video files anyway. This is for power users of
videos.

But when we have a power user and they want to make use of all the
functionality their media files offer and they have no way of exposing
this in a standard way, we will create a lot of frustration and
incompatible implementations when people try to implement this with
javascript.

Eric further wrote:
>   If we do allow this, what happens when the structure declared in the
> markup differs from the structure of the media file?

The same as what happens when other markup is wrong or points to
something that doesn't exist: we 404 or deal with the error. HTML is
well know for its ability to deal with errors well.

Incidentally, we do need to develop the javascript API for exposing
the video's tracks no matter whether we do it in declarative syntax or
not. Here's a start at a proposal for this (obviously inspired by the
markup):

  video.numberTracks(); -> return number of available tracks
  video.firstTrack(); -> returns first track ("first" to be defined -
e.g. there is no inherent order in Ogg)
  video.lastTrack(); -> returns last track ("last" to be defined)
  track.next(); -> returns next track in list
  track has the following attributes: type, ref, lang, role, media
(and the usual contenders, e.g. id, style)


Philip said:
> <source> is a void element, so this markup does not degrade nicely in any
> shipped <video>-capable browsers. Try
> <http://software.hixie.ch/utilities/js/live-dom-viewer/saved/318>. Firefox
> puts the second <source> element inside nested <track> elements and Safari
> just drops it.

That is disappointing. This means we have to try and find a different
way of marking it up. Maybe we can just throw a <tracks> element
underneath each <source> element, as in this:

 <video>
  <source src="video.ogv" type="video/ogg">
  <tracks>
   <track id='ogg_v' role='video' ref='serialno:1505760010'></track>
   <track id='ogg_a' role='audio' lang='en' ref='serialno:0821695999'></track>
   <track id='ogg_ad' role='auddesc' lang='en'
ref='serialno:1421614520'></track>
   <track id='ogg_s' role='sign' lang='ase' ref='serialno:1413244634'></track>
   <track id='ogg_cc' role='caption' lang='en'
ref='serialno:1421849818'></track>
 </tracks>
 <source src="video.mp4" type="video/mp4">
 <tracks>
   <track id='mp4_v' role='video' ref='trackid:1'></track>
   <track id='mp4_a' role='audio' lang='en' ref='trackid:2'></track>
 </tracks>
 <overlay>
   <source src="en.srt" lang="en-US">
   <source src="hans.srt" lang="zh-CN">
 </overlay>
</video>

Is it guaranteed that the order is retained and therefore, can we
guarantee the association of the tracks element to the previous source
element?

An alternative would be to have such resource composition stored in a
separate file - a resource composition xml file (?) - on the server
and to link to it in the <source> element (or the <video> element if
there's only one). Then, it's not polluting the html markup and the UA
doesn't have to parse a lengthy media file but rather only has to
parse a separately retrieved xml file. For example:

<video>
 <source src="video.ogg" type="video/ogg" rcf="video.ogg.rcf">
 <source src="video.mp4" type="video/mp4" rcf="video.mpg.rcf">
 <overlay>
   <source src="en.srt" lang="en-US">
   <source src="hans.srt" lang="zh-CN">
 </overlay>
</video>




Now, let's talk about the <overlay> element.

I am not too fussed about renaming <itextlist> to <overlay>. I can see
why you would go for this name - because most text will be rendered on
top of or next to the video generally. It essentially provides a "div"
into which the data can be rendered, rather than an abstract structure
like my "itextlist". My intention was to keep the structure and the
presentation separate from each other. But if it's general agreement
that "overlay" is a better name, I'm happy to go with it. (Also, I'm
happy to rename "itext" to "source", since that was already what I had
started doing in
http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
, where I've also renamed "category" to "role").

I'm assuming that in an example like this one below (no matter in
which way the tracks are exposed), the caption track of the ogg file
would be another track in the <source> element if the UA chose that
video.ogv file over the video.mp4 file?

<video>
  <source src="video.ogv" type="video/ogg">
  <tracks>
   <track id='ogg_v' role='video' ref='serialno:1505760010'></track>
   <track id='ogg_a' role='audio' lang='en' ref='serialno:0821695999'></track>
   <track id='ogg_ad' role='auddesc' lang='en'
ref='serialno:1421614520'></track>
   <track id='ogg_s' role='sign' lang='ase' ref='serialno:1413244634'></track>
   <track id='ogg_cc' role='caption' lang='en'
ref='serialno:1421849818'></track>
 </tracks>
 <source src="video.mp4" type="video/mp4">
 <tracks>
   <track id='mp4_v' role='video' ref='trackid:1'></track>
   <track id='mp4_a' role='audio' lang='en' ref='trackid:2'></track>
 </tracks>
 <overlay>
   <source src="en.srt" lang="en-US">
   <source src="hans.srt" lang="zh-CN">
 </overlay>
</video>

I.e. it would be parsed to something like:

<video>
  <source src="video.ogv" type="video/ogg">
  <overlay>
   <source src="en.srt" lang="en-US">
   <source src="hans.srt" lang="zh-CN">
   <source ref='serialno:1421849818' lang="en">
 </overlay>
</video>

This makes it an additional caption track to display. Is this right?
There are no alternative choices between tracks?


I would actually suggest that if we want to go with <overlay>, we need
to specify different overlays for different types of text. In this way
we can accommodate textual audio descriptions, captions, subtitles
etc. Then, I would suggest that for every type of text there should
every only be one <source> displayed. It is not often that you want
more than one subtitle track displayed. You most certainly never want
to have more than one caption track displayed and never more than one
textual audio description track. But you do want each one of them
displayed in addition to the other.

For example:

<video src="video.ogg">
  <overlay role="caption"
style="font-size:2em;padding:1em;text-align:center; display: block;">
    <source src="en-us.srt" lang="en-US">
    <source src="en.srt" lang="en">
  </overlay>
  <overlay role="tad" style="z-index: -100; display: block;"
aria-live="assertive">
    <source src="tad-en.srt" lang="en">
    <source src="tad-de.srt" lang="de">
  </overlay>
  <overlay role="subtitle"
style="font-size:2em;padding:1em;text-align:center; display: block;">
    <source src="de.srt" lang="de">
    <source src="sv.srt" lang="sv">
    <source src="fi.srt" lang="fi">
  </overlay>
</video>


BTW: somewhere along the discussion between Philip and Maciej you lost
me, so no comments on those.

Cheers,
Silvia.
Received on Wednesday, 27 January 2010 11:58:43 UTC