Re: timing model of the media resource in HTML5 from Silvia Pfeiffer on 2010-02-01 (public-html-a11y@w3.org from February 2010)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Mon, 1 Feb 2010 23:19:59 +1100
To: Philip Jägenstedt <philipj@opera.com>
Cc: Eric Carlson <eric.carlson@apple.com>, HTML Accessibility Task Force <public-html-a11y@w3.org>, Ken Harrenstien <klh@google.com>
Message-ID: <2c0e02831002010419x50664125u3b3b9d83d67dba31@mail.gmail.com>
On Fri, Jan 29, 2010 at 12:39 AM, Philip Jägenstedt <philipj@opera.com> wrote:
> Deep breath...

Same here. :-)


> On Wed, 27 Jan 2010 12:57:51 +0100, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com> wrote:
>
>> Ken Harrenstien from Google wrote this to me (and allowed me to quote
>> him, which is why I cc-ed him):
>>>
>>> The principal reason for wanting to allow explicit markup is latency
>>> and infrastructure overhead.
>>>
>>> Without the markup, the only way to know what's in-band is to start
>>> streaming the video.  How long will it take to find out what kinds of
>>> captions it contains and whether they are supported?  How much
>>> bandwidth and setup is wasted in the process?  At Google we care very
>>> deeply about those things.
>>>
>>> I think this information is very, if not exactly, analogous to the
>>> other markup provided for <video>. I need it to tell immediately if the
>>> video is even
>>> playable/watchable for me (as a hearing-impaired person).
>>
>> I believe he has a strong case.
>
> I don't agree that this is an important enough use case to add extra HTML
> markup for. The issue seems to be that perhaps finding the tracks in the
> resources is slow. If that's the case and they want to immediately present a
> menu of all available tracks, I suggest embedding the information using the
> data-* attributes or similar. Since the information can easily get out of
> sync with the actual resource when markup is copied around, I wouldn't be
> willing to rely on any such markup for the browser native controls or
> context menus. Native controls that get their information directly from the
> resource and an API to expose track information for scripted controls is the
> way to do, in my opinion. Getting the information would be done as a part of
> reaching HAVE_METADATA and I really don't think this will add noticeably to
> load time.

This is a bit dismissive of the actual issues. I hope we will make an
informed decision when we say we don't want to expose the track
structure in the browser, but only have a javascript API to the track
information.

We need to be aware that delays and bandwidth implications that we introduce.

We had the discussion over on public-html that for pages that have
many videos on them, we will want to avoid having to preload anything
(not even go to HAVE_METADATA). Assume a page with hundreds of videos
which all have @preload="none" and now you're trying to build the menu
for displaying which subtitles are available. If we do not have
explicit markup, we either cannot display that menu or we have to
actually preload the metadata for each video and thus ignore the
attribute.

I think *not* displaying a menu is inacceptable, since it is a huge
loss to accessibility. How would a deaf person find out whether it is
worth their while clicking on a video for playback without a menu?
(Note that "menu" is here used in the YouTube sense where there is an
explicit menu button on the transport bar if captions exist, as e.g.
in http://www.youtube.com/watch?v=-OPik35LpN8 .)

So, we force each and every person to preload their videos just to be
able to display the menu? Or do we just force that upon the people
that are in need of accessibility? Force preloading of video content
onto them, even if they may not be able to watch the videos usefully?
Make them use bandwidth that may not be useful at all? Introduce
delays that are really not necessary?

I think if we now just step back and say: leave it to YouTube to solve
it if they see it as a problem - let them introduce data-* attributes
and other stuff - I think we are just asking for somebody else to do
the work for us, and we are building the basis for incompatible
solutions across Websites, incompatible extra browser extensions that
those browsers who see the need will introduce, and we simply avoid
solving the problem properly.

I really haven't heard a useful argument against such markup yet.

The argument that the markup may be wrong is a general argument
against any element on a HTML page. What if our image element links to
a non-existent image file? What if the javascript is broken? What if
the markup of the Web page is non-conformant? I think there are a
billion things in which a Web page can have errors in its markup and
still browsers try to render each and every one of them and even
tolerate some errors. They deal with failure. Why is that not possible
with video, too?


>> If we buried the track information in a javascript API, we would
>> introduce an additional dependency and we would remove the ability to
>> simply parse the Web page to get at such information. For example, a
>> crawler would not be able to find out that there is a resource with
>> captions and would probably not bother requesting the resource for its
>> captions (or other text tracks).
>
> Surely, robots would just index the resources themselves?

Why download binary data of indeterminate length when you can already
get it out of the text of the Web page? Surely, robots would prefer to
get that information directly out of the Webpage and not have to go
and download gazillions of binary media files that they have to decode
to get information about them.

Right now, everybody who sees a video element in a HTML5 page simply
assumes that it consists of a video and a audio track and has no other
information in it. This is fine in the default case and in the default
case no extra resource description is probably necessary. But when we
actually do have a richer source, we need to expose that.


>> Eric further said:
>>>
>>> It seems to me that because it will require
>>> new specialized tools get the information, and because it will be really
>>> difficult to do correctly (ten digit serial numbers?), people are likely
>>> to
>>> just skip it completely.
>>
>> There is a need for addressing the track in a unique way, i.e.
>> javascript needs to be able to tell the media framework exactly which
>> track it is talking about (e.g. to turn it on or off).
>
> The API for exposing tracks should simply have something like .enable() for
> each track object (or similar), there's no need to expose unique IDs to do
> this.

How would you identify a track if you do not have a unique ID? what
would .enable() work on if not a track identifier?
video.firstTrack().enable() ? Even this implies that there is an order
and that we know what is in the track.


> It might be needed if we want to have e.g. <video tracks="video0,audio3"> or
> similar. Media Fragments URI is supposed to provide a syntax for addressing
> individual tracks, perhaps we can hook into that at some level?

Indeed the only means for addressing tracks uniquely in Media
Fragments URI for Ogg files has so far been the serial number of the
track. This is why I am trying to introduce track names into Ogg, so
we have a more readable means of addressing tracks. But from a machine
processing POV only the serial numbers are guaranteed to be uniquely
identifying.

There was a suggestion that it would be possible to number through the
tracks in an Ogg file based on their order given buy the serial
number. This might be possible, but introduce an additional source of
errors: what if an additional track is added to the Ogg file and that
track (since its serial number is random) overthrows the order of the
tracks. That will cause a lot more errors to happen than if we just
used the serial numbers directly.

I read that MPEG numbers its tracks through, but if somebody knows
exactly how it works with MPEG, that would be an interesting
contribution. From what I can tell, the tracks are just in a sequence,
so if you remove a track in the middle or add one, the addressing
doesn't stay persistent either. But I may be wrong and would really
like some input here. Is it possible to address something like
video[0] and captions[0]?


>> Incidentally, we do need to develop the javascript API for exposing
>> the video's tracks no matter whether we do it in declarative syntax or
>> not. Here's a start at a proposal for this (obviously inspired by the
>> markup):
>>
>>  video.numberTracks(); -> return number of available tracks
>>  video.firstTrack(); -> returns first track ("first" to be defined -
>> e.g. there is no inherent order in Ogg)
>>  video.lastTrack(); -> returns last track ("last" to be defined)
>>  track.next(); -> returns next track in list
>>  track has the following attributes: type, ref, lang, role, media
>> (and the usual contenders, e.g. id, style)
>
> Yes, we need something like this.

OK, so if we cannot right now agree to have actual declarative syntax
for it, could we for the moment focus on developing that API? While
implementing this API, we will at least find out its flaws and we will
also be able to exactly measure how much time and bandwidth is used in
comparison to having declarative syntax provide this information.




>> Philip said:
>> An alternative would be to have such resource composition stored in a
>> separate file - a resource composition xml file (?) - on the server
>> and to link to it in the <source> element (or the <video> element if
>> there's only one). Then, it's not polluting the html markup and the UA
>> doesn't have to parse a lengthy media file but rather only has to
>> parse a separately retrieved xml file. For example:
>>
>> <video>
>>  <source src="video.ogg" type="video/ogg" rcf="video.ogg.rcf">
>>  <source src="video.mp4" type="video/mp4" rcf="video.mpg.rcf">
>>  <overlay>
>>   <source src="en.srt" lang="en-US">
>>   <source src="hans.srt" lang="zh-CN">
>>  </overlay>
>> </video>
>
> Would this be any different from linking to the resource directly, which
> quite certainly knows its composition best?

It can be extended with the data in the overlay and then it's easier
to get a full (virtual) media resource composition by just embedding
the rcf (or whatever we call it) file. Then, it can be embedded
elsewhere and will always be up-to-date rather than dependent on
changes in the markup and the video file. But this is advanced
functionality, which I don't think we need to discuss at this stage.
Let's sort out the basic functionality first.


>> Now, let's talk about the <overlay> element.
>>
>> I am not too fussed about renaming <itextlist> to <overlay>. I can see
>> why you would go for this name - because most text will be rendered on
>> top of or next to the video generally. It essentially provides a "div"
>> into which the data can be rendered, rather than an abstract structure
>> like my "itextlist". My intention was to keep the structure and the
>> presentation separate from each other. But if it's general agreement
>> that "overlay" is a better name, I'm happy to go with it. (Also, I'm
>> happy to rename "itext" to "source", since that was already what I had
>> started doing in
>>
>> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
>> , where I've also renamed "category" to "role").
>>
>> I'm assuming that in an example like this one below (no matter in
>> which way the tracks are exposed), the caption track of the ogg file
>> would be another track in the <source> element if the UA chose that
>> video.ogv file over the video.mp4 file?
>>
>> <video>
>>  <source src="video.ogv" type="video/ogg">
>>  <tracks>
>>   <track id='ogg_v' role='video' ref='serialno:1505760010'></track>
>>   <track id='ogg_a' role='audio' lang='en'
>> ref='serialno:0821695999'></track>
>>   <track id='ogg_ad' role='auddesc' lang='en'
>> ref='serialno:1421614520'></track>
>>   <track id='ogg_s' role='sign' lang='ase'
>> ref='serialno:1413244634'></track>
>>   <track id='ogg_cc' role='caption' lang='en'
>> ref='serialno:1421849818'></track>
>>  </tracks>
>>  <source src="video.mp4" type="video/mp4">
>>  <tracks>
>>   <track id='mp4_v' role='video' ref='trackid:1'></track>
>>   <track id='mp4_a' role='audio' lang='en' ref='trackid:2'></track>
>>  </tracks>
>>  <overlay>
>>   <source src="en.srt" lang="en-US">
>>   <source src="hans.srt" lang="zh-CN">
>>  </overlay>
>> </video>
>>
>> I.e. it would be parsed to something like:
>>
>> <video>
>>  <source src="video.ogv" type="video/ogg">
>>  <overlay>
>>   <source src="en.srt" lang="en-US">
>>   <source src="hans.srt" lang="zh-CN">
>>   <source ref='serialno:1421849818' lang="en">
>>  </overlay>
>> </video>
>>
>> This makes it an additional caption track to display. Is this right?
>> There are no alternative choices between tracks?
>>
>>
>> I would actually suggest that if we want to go with <overlay>, we need
>> to specify different overlays for different types of text. In this way
>> we can accommodate textual audio descriptions, captions, subtitles
>> etc. Then, I would suggest that for every type of text there should
>> every only be one <source> displayed. It is not often that you want
>> more than one subtitle track displayed. You most certainly never want
>> to have more than one caption track displayed and never more than one
>> textual audio description track. But you do want each one of them
>> displayed in addition to the other.
>>
>> For example:
>>
>> <video src="video.ogg">
>>  <overlay role="caption"
>> style="font-size:2em;padding:1em;text-align:center; display: block;">
>>    <source src="en-us.srt" lang="en-US">
>>    <source src="en.srt" lang="en">
>>  </overlay>
>>  <overlay role="tad" style="z-index: -100; display: block;"
>> aria-live="assertive">
>>    <source src="tad-en.srt" lang="en">
>>    <source src="tad-de.srt" lang="de">
>>  </overlay>
>>  <overlay role="subtitle"
>> style="font-size:2em;padding:1em;text-align:center; display: block;">
>>    <source src="de.srt" lang="de">
>>    <source src="sv.srt" lang="sv">
>>    <source src="fi.srt" lang="fi">
>>  </overlay>
>> </video>
>>
>>
>
> I agree on adding something like role="". On the naming, Maciej pointed out
> and I now agree that <overlay> is presentational and not really a brilliant
> choice. I think this should be controlled by CSS in some way or anthoer.

I agree there needs to be a default CSS. I would even suggest if
possible to make this dependent on the role.

For example:
* captions
 color: white;
 background-color: #333333;
 opacity:0.8;
 text-align: center;
 bottom: 0;
 position:absolute;

* textual audio descriptions (if we can agree to use them)
 visibility: hidden; (unless this makes screen readers not read them out)
 aria-live: assertive;
 position: absolute;
 z-index: -100; (or more - shouldn't be visible)


> What we agree on so far seems to be:
>
> <video src="video">
>  <sourcelist role="subtitle">
>    <source src="subtitles.en.srt" lang="en">
>  </sourcelist>
> </video>
>
> Where <sourcelist> is whatever name we can agree on. Maybe something that
> sounds like it has to do with timed text, I don't know.

How about the @type and @charset attributes? I think we need the @type
attribute (for the same reason as we use if in the <source> elements).
We could make the @charset attribute go away by enforcing the charset
to be included in the @type where necessary.

Also: Are subtitles all you can agree with? They are not really an
accessibility means, but rather an internationalisation means, that
can just be covered in the same way as captions. So, could I suggest
to add at least the role of "caption"?

Also, I would like people to start experiment a bit more with "textual
audio descriptions" and would thus suggest to add it to the draft
specification. I know it works, but I am sure others would like to
make their own experiments on them.

I'd be happy for now if we can start with this as a proposal to solve
externally associated text files. Maybe rename <sourcelist> to
<textlist>, since these are all text files we are talking about.
Interestingly, if we leave it at sourcelist, we can also use it for
external audio and video files to associate.

Maybe we can have a more in-depth discussion about this proposal which
is similar to both
https://wiki.mozilla.org/Accessibility/HTML5_captions_v2 and
http://wiki.whatwg.org/wiki/Video_Overlay. If we agree on this list,
we can then put a bug into the public-html tracker with a proposal to
change? It seems there has not been much of a challenge to this
proposal in principle so far.


Going forward, we need to solve the other issues.

How would you suggest to solve the problems of in-stream text tracks
and those of audio description sound files and sign language videos?

Cheers,
Silvia.
Received on Monday, 1 February 2010 12:20:57 UTC