Re: timing model of the media resource in HTML5 from Philip Jägenstedt on 2010-02-01 (public-html-a11y@w3.org from February 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Mon, 01 Feb 2010 18:06:17 +0100
To: "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>
Cc: "Eric Carlson" <eric.carlson@apple.com>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>, "Ken Harrenstien" <klh@google.com>
Message-ID: <op.u7gkgrahatwj1d@sisko.linkoping.osa>
On Mon, 01 Feb 2010 13:19:59 +0100, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> On Fri, Jan 29, 2010 at 12:39 AM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>> Deep breath...
>
> Same here. :-)
>
>
>> On Wed, 27 Jan 2010 12:57:51 +0100, Silvia Pfeiffer
>> <silviapfeiffer1@gmail.com> wrote:
>>
>>> Ken Harrenstien from Google wrote this to me (and allowed me to quote
>>> him, which is why I cc-ed him):
>>>>
>>>> The principal reason for wanting to allow explicit markup is latency
>>>> and infrastructure overhead.
>>>>
>>>> Without the markup, the only way to know what's in-band is to start
>>>> streaming the video.  How long will it take to find out what kinds of
>>>> captions it contains and whether they are supported?  How much
>>>> bandwidth and setup is wasted in the process?  At Google we care very
>>>> deeply about those things.
>>>>
>>>> I think this information is very, if not exactly, analogous to the
>>>> other markup provided for <video>. I need it to tell immediately if  
>>>> the
>>>> video is even
>>>> playable/watchable for me (as a hearing-impaired person).
>>>
>>> I believe he has a strong case.
>>
>> I don't agree that this is an important enough use case to add extra  
>> HTML
>> markup for. The issue seems to be that perhaps finding the tracks in the
>> resources is slow. If that's the case and they want to immediately  
>> present a
>> menu of all available tracks, I suggest embedding the information using  
>> the
>> data-* attributes or similar. Since the information can easily get out  
>> of
>> sync with the actual resource when markup is copied around, I wouldn't  
>> be
>> willing to rely on any such markup for the browser native controls or
>> context menus. Native controls that get their information directly from  
>> the
>> resource and an API to expose track information for scripted controls  
>> is the
>> way to do, in my opinion. Getting the information would be done as a  
>> part of
>> reaching HAVE_METADATA and I really don't think this will add  
>> noticeably to
>> load time.
>
> This is a bit dismissive of the actual issues. I hope we will make an
> informed decision when we say we don't want to expose the track
> structure in the browser, but only have a javascript API to the track
> information.
>
> We need to be aware that delays and bandwidth implications that we  
> introduce.

Indeed we should make an informed decision. One of the factors to consider  
is added complexity and cost of implementation for any new markup we add.

> We had the discussion over on public-html that for pages that have
> many videos on them, we will want to avoid having to preload anything
> (not even go to HAVE_METADATA). Assume a page with hundreds of videos
> which all have @preload="none" and now you're trying to build the menu
> for displaying which subtitles are available. If we do not have
> explicit markup, we either cannot display that menu or we have to
> actually preload the metadata for each video and thus ignore the
> attribute.
>
> I think *not* displaying a menu is inacceptable, since it is a huge
> loss to accessibility. How would a deaf person find out whether it is
> worth their while clicking on a video for playback without a menu?
> (Note that "menu" is here used in the YouTube sense where there is an
> explicit menu button on the transport bar if captions exist, as e.g.
> in http://www.youtube.com/watch?v=-OPik35LpN8 .)

Before HAVE_METADATA the dimensions and duration of the video are also  
unknown. These are also important for considering whether a video is  
relevant or not. Users of preload="none" have to bear the costs of that  
choice by including any metadata out-of-band, likely in the markup. Where  
we disagree is if this should be in new attributes/elements (e.g. <video  
duration="34.2">) or in e.g. data-* attributes or something else that user  
agents ignore.

> So, we force each and every person to preload their videos just to be
> able to display the menu? Or do we just force that upon the people
> that are in need of accessibility? Force preloading of video content
> onto them, even if they may not be able to watch the videos usefully?
> Make them use bandwidth that may not be useful at all? Introduce
> delays that are really not necessary?
>
> I think if we now just step back and say: leave it to YouTube to solve
> it if they see it as a problem - let them introduce data-* attributes
> and other stuff - I think we are just asking for somebody else to do
> the work for us, and we are building the basis for incompatible
> solutions across Websites, incompatible extra browser extensions that
> those browsers who see the need will introduce, and we simply avoid
> solving the problem properly.
>
> I really haven't heard a useful argument against such markup yet.

New markup like this will only benefit users if sites actually go to the  
trouble of adding it. If sites are willing to do that work, then they can  
also expose the information in other ways, e.g. in a scripted menu, with  
icons, or any number of ways in-page. That different sites would solve  
this differently doesn't seem like a problem. It sounds like the worst  
case browser-incompatibility is that some browsers populate the context  
menus from markup when readyState < HAVE_METADATA and some do not. Even  
so, this shouldn't happen if all interested parties take part in the  
discussion.

> The argument that the markup may be wrong is a general argument
> against any element on a HTML page. What if our image element links to
> a non-existent image file? What if the javascript is broken? What if
> the markup of the Web page is non-conformant? I think there are a
> billion things in which a Web page can have errors in its markup and
> still browsers try to render each and every one of them and even
> tolerate some errors. They deal with failure. Why is that not possible
> with video, too?

I think this is quite similar to an unloaded <img>. For such an image you  
don't know the size and you won't find any (useful) properties in the  
context menu.

As a side note, it seems that identifying tracks (uniquely, or at all) is  
actually unnecessary for the use case. You just need a list of available  
tracks that browsers can show in the context menu until HAVE_METADATA, at  
which point the actual tracks will be shown (hopefully be the same).

>>> If we buried the track information in a javascript API, we would
>>> introduce an additional dependency and we would remove the ability to
>>> simply parse the Web page to get at such information. For example, a
>>> crawler would not be able to find out that there is a resource with
>>> captions and would probably not bother requesting the resource for its
>>> captions (or other text tracks).
>>
>> Surely, robots would just index the resources themselves?
>
> Why download binary data of indeterminate length when you can already
> get it out of the text of the Web page? Surely, robots would prefer to
> get that information directly out of the Webpage and not have to go
> and download gazillions of binary media files that they have to decode
> to get information about them.

Because in the general case you cannot get it from the text of the Web  
page, because authors won't put track information in the markup except in  
quite special cases (preload="none" and multiple audio/video tracks). It  
seems an unlikely optimization that robots should trust the markup in the  
rare case it is there instead of going to the source.

> Right now, everybody who sees a video element in a HTML5 page simply
> assumes that it consists of a video and a audio track and has no other
> information in it. This is fine in the default case and in the default
> case no extra resource description is probably necessary. But when we
> actually do have a richer source, we need to expose that.

I agree it should be exposed, but through a DOM API that will always be  
up-to-date and requires no special markup.

>>> Eric further said:
>>>>
>>>> It seems to me that because it will require
>>>> new specialized tools get the information, and because it will be  
>>>> really
>>>> difficult to do correctly (ten digit serial numbers?), people are  
>>>> likely
>>>> to
>>>> just skip it completely.
>>>
>>> There is a need for addressing the track in a unique way, i.e.
>>> javascript needs to be able to tell the media framework exactly which
>>> track it is talking about (e.g. to turn it on or off).
>>
>> The API for exposing tracks should simply have something like .enable()  
>> for
>> each track object (or similar), there's no need to expose unique IDs to  
>> do
>> this.
>
> How would you identify a track if you do not have a unique ID? what
> would .enable() work on if not a track identifier?
> video.firstTrack().enable() ? Even this implies that there is an order
> and that we know what is in the track.

Presumably the tracks could have language-information and names. But sure,  
id should also be exposed as track.id or similar. Still, the order of  
tracks would have to be well-defined or you'll get cross-browser  
incompatibilities in the blink of an eye.

[snip]

>>> Incidentally, we do need to develop the javascript API for exposing
>>> the video's tracks no matter whether we do it in declarative syntax or
>>> not. Here's a start at a proposal for this (obviously inspired by the
>>> markup):
>>>
>>>  video.numberTracks(); -> return number of available tracks
>>>  video.firstTrack(); -> returns first track ("first" to be defined -
>>> e.g. there is no inherent order in Ogg)
>>>  video.lastTrack(); -> returns last track ("last" to be defined)
>>>  track.next(); -> returns next track in list
>>>  track has the following attributes: type, ref, lang, role, media
>>> (and the usual contenders, e.g. id, style)
>>
>> Yes, we need something like this.
>
> OK, so if we cannot right now agree to have actual declarative syntax
> for it, could we for the moment focus on developing that API? While
> implementing this API, we will at least find out its flaws and we will
> also be able to exactly measure how much time and bandwidth is used in
> comparison to having declarative syntax provide this information.

Yes, let's do that. It's worth taking a look at  
http://www.w3.org/TR/mediaont-api-1.0/#webidl-for-api, but it crucially  
lacks the single most important thing we need -- a way to distinguish  
between different tracks.

[snip]

>>> Now, let's talk about the <overlay> element.
>>>
>>> I am not too fussed about renaming <itextlist> to <overlay>. I can see
>>> why you would go for this name - because most text will be rendered on
>>> top of or next to the video generally. It essentially provides a "div"
>>> into which the data can be rendered, rather than an abstract structure
>>> like my "itextlist". My intention was to keep the structure and the
>>> presentation separate from each other. But if it's general agreement
>>> that "overlay" is a better name, I'm happy to go with it. (Also, I'm
>>> happy to rename "itext" to "source", since that was already what I had
>>> started doing in
>>>
>>> http://blog.gingertech.net/2009/11/25/manifests-exposing-structure-of-a-composite-media-resource/
>>> , where I've also renamed "category" to "role").
>>>
>>> I'm assuming that in an example like this one below (no matter in
>>> which way the tracks are exposed), the caption track of the ogg file
>>> would be another track in the <source> element if the UA chose that
>>> video.ogv file over the video.mp4 file?
>>>
>>> <video>
>>>  <source src="video.ogv" type="video/ogg">
>>>  <tracks>
>>>   <track id='ogg_v' role='video' ref='serialno:1505760010'></track>
>>>   <track id='ogg_a' role='audio' lang='en'
>>> ref='serialno:0821695999'></track>
>>>   <track id='ogg_ad' role='auddesc' lang='en'
>>> ref='serialno:1421614520'></track>
>>>   <track id='ogg_s' role='sign' lang='ase'
>>> ref='serialno:1413244634'></track>
>>>   <track id='ogg_cc' role='caption' lang='en'
>>> ref='serialno:1421849818'></track>
>>>  </tracks>
>>>  <source src="video.mp4" type="video/mp4">
>>>  <tracks>
>>>   <track id='mp4_v' role='video' ref='trackid:1'></track>
>>>   <track id='mp4_a' role='audio' lang='en' ref='trackid:2'></track>
>>>  </tracks>
>>>  <overlay>
>>>   <source src="en.srt" lang="en-US">
>>>   <source src="hans.srt" lang="zh-CN">
>>>  </overlay>
>>> </video>
>>>
>>> I.e. it would be parsed to something like:
>>>
>>> <video>
>>>  <source src="video.ogv" type="video/ogg">
>>>  <overlay>
>>>   <source src="en.srt" lang="en-US">
>>>   <source src="hans.srt" lang="zh-CN">
>>>   <source ref='serialno:1421849818' lang="en">
>>>  </overlay>
>>> </video>
>>>
>>> This makes it an additional caption track to display. Is this right?
>>> There are no alternative choices between tracks?
>>>
>>>
>>> I would actually suggest that if we want to go with <overlay>, we need
>>> to specify different overlays for different types of text. In this way
>>> we can accommodate textual audio descriptions, captions, subtitles
>>> etc. Then, I would suggest that for every type of text there should
>>> every only be one <source> displayed. It is not often that you want
>>> more than one subtitle track displayed. You most certainly never want
>>> to have more than one caption track displayed and never more than one
>>> textual audio description track. But you do want each one of them
>>> displayed in addition to the other.
>>>
>>> For example:
>>>
>>> <video src="video.ogg">
>>>  <overlay role="caption"
>>> style="font-size:2em;padding:1em;text-align:center; display: block;">
>>>    <source src="en-us.srt" lang="en-US">
>>>    <source src="en.srt" lang="en">
>>>  </overlay>
>>>  <overlay role="tad" style="z-index: -100; display: block;"
>>> aria-live="assertive">
>>>    <source src="tad-en.srt" lang="en">
>>>    <source src="tad-de.srt" lang="de">
>>>  </overlay>
>>>  <overlay role="subtitle"
>>> style="font-size:2em;padding:1em;text-align:center; display: block;">
>>>    <source src="de.srt" lang="de">
>>>    <source src="sv.srt" lang="sv">
>>>    <source src="fi.srt" lang="fi">
>>>  </overlay>
>>> </video>
>>>
>>>
>>
>> I agree on adding something like role="". On the naming, Maciej pointed  
>> out
>> and I now agree that <overlay> is presentational and not really a  
>> brilliant
>> choice. I think this should be controlled by CSS in some way or anthoer.
>
> I agree there needs to be a default CSS. I would even suggest if
> possible to make this dependent on the role.
>
> For example:
> * captions
>  color: white;
>  background-color: #333333;
>  opacity:0.8;
>  text-align: center;
>  bottom: 0;
>  position:absolute;
>
> * textual audio descriptions (if we can agree to use them)
>  visibility: hidden; (unless this makes screen readers not read them out)
>  aria-live: assertive;
>  position: absolute;
>  z-index: -100; (or more - shouldn't be visible)
>
>
>> What we agree on so far seems to be:
>>
>> <video src="video">
>>  <sourcelist role="subtitle">
>>    <source src="subtitles.en.srt" lang="en">
>>  </sourcelist>
>> </video>
>>
>> Where <sourcelist> is whatever name we can agree on. Maybe something  
>> that
>> sounds like it has to do with timed text, I don't know.
>
> How about the @type and @charset attributes? I think we need the @type
> attribute (for the same reason as we use if in the <source> elements).
> We could make the @charset attribute go away by enforcing the charset
> to be included in the @type where necessary.

@type is fine, but should be optional of course like on <source>.

About @charset I'd prefer to force user agents to trust the server-sent  
encoding or assume UTF-8 if it isn't given, but reality will tell us what  
makes sense here soon enough.

> Also: Are subtitles all you can agree with? They are not really an
> accessibility means, but rather an internationalisation means, that
> can just be covered in the same way as captions. So, could I suggest
> to add at least the role of "caption"?

I'm fine with any/all of the roles, as long as it's text. I don't know  
what a user agent should do with it though, if anything.

[snip]

> How would you suggest to solve the problems of in-stream text tracks
> and those of audio description sound files and sign language videos?

For audio/video tracks, by exposing them in the browser context menus and  
providing the DOM APIs to make it possible to do the same with scripted  
controls.

For text it's basically the same except we also need to figure out how to  
render it and how it interacts with CSS (if at all). Because this is a bit  
messy, I'm much more interested in sorting out how to handle external  
subtitles right now.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Monday, 1 February 2010 17:07:13 UTC