Re: timing model of the media resource in HTML5 from Philip Jägenstedt on 2009-11-30 (public-html-a11y@w3.org from November 2009)

From: Philip Jägenstedt <philipj@opera.com>
Date: Mon, 30 Nov 2009 23:08:13 +0100
To: "Maciej Stachowiak" <mjs@apple.com>
Cc: "Eric Carlson" <eric.carlson@apple.com>, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>
Message-ID: <op.u38afzgbsr6mfa@worf>
On Mon, 30 Nov 2009 15:30:03 +0100, Maciej Stachowiak <mjs@apple.com>  
wrote:

>
> On Nov 29, 2009, at 4:11 AM, Philip Jägenstedt wrote:
>
>> On Sat, 28 Nov 2009 19:42:15 +0100, Maciej Stachowiak
>> <mjs@apple.com> wrote:
>>>
>>> This interface works ok for the specific case of popping up some
>>> text, but it seems like it would be awkward for anything more
>>> complicated, since there is only a single event and set of
>>> handlers. What I would suggest is the declarative cue range idea
>>> that was suggested on the whatwg list a while back:
>>>
>>> <video>
>>>     <source type="video/mp4" src="video.m4v">
>>>     <timerange start="10" end="12" onrangeenter="enterRange1()"
>>> onrangeleave="leaveRange1()">
>>> </video>
>>>
>>> This makes it really easy to have different handlers per cue range
>>> without having to express that difference as a string. It also
>>> makes it simpler to use cue ranges for two orthogonal purposes.
>>>
>>> addCureRange() could just be a shortcut for adding such a <range>
>>> element:
>>>
>>> var range = v.addCueRange(10, 12);
>>> range.addEventListener("rangeenter", function(e)
>>> { e.target.querySelector('overlay').textContent = "Hello"; }, false);
>>> range.addEventListener("rangeleave", function(e)
>>> { e.target.querySelector('overlay').textContent = ''; }, false);
>>
>> If addCueRange does nothing but insert elements in the DOM then we
>> don't need it at all, simply let script authors write it themselves
>> if they need a shortcut.
>
> It's not uncommon for HTML elements to have shortcuts in their DOM
> interface for things you could do in theory by DOM manipulation, for
> example, consider the HTMLTableElement API. In this case, the
> convenience method lets you collapse at least 4 lines into 1, and lets
> you ignore the fact that the range is represented by a DOM element.
> (It might make sense to make it addTimeRange() to match the element
> name, or rename the element <cuerange>.)

My (current) thinking is that it should be possible for scripts to  
register callbacks for specific time ranges without that adding elements  
to the DOM, because in many cases such elements would be of no use and  
basically just waste memory.

>> It has been by working assumption that external SRT file should fire
>> the same events, so not all ranges are represented as an element in
>> the DOM. addCueRange would then be a way to add such not-in-DOM
>> ranges.
>
> 1) Why playback of an SRT file need to fire range enter/leave events?
> Is there a use case for this?

Basically to take the text, manipulate it and display it in some other  
fashion than overlayed, e.g.: drawing the text in <canvas> with effects;  
scrolling the dialog in "IRC log" form (I would actually quite like this  
just for the scrollback); highlighting search keywords; muting the sound  
whenever the word fsck appears (for the UNIX-haters out there). However,  
I'm not certain that the added complexity of exposing external subtitles  
via the DOM is justified.

>      1.a) Does it critically need to be the same events?

Not critically, but "all else equal".

>      1.b) Even if it's the same events, couldn't we dispatch them on
> whatever element embeds the text track?

Sure.

> 2) If this is built in UA behavior, there's no need to have a public
> DOM API to implement it.

See 1)

>> In <https://wiki.mozilla.org/Accessibility/Experiment1_feedback> I
>> suggested a MediaTimeRange interface. Remxing that somewhat:
>>
>> interface MediaTimeRange {
>>  attribute double start;
>>  attribute double end;
>>  //attribute DOMString text;
>>  // FIXME: how to represent the content?
>> }
>>
>> interface MediaTimeRangeList {
>>  // automatically sorted by increasing time
>>  readonly attribute unsigned long length;
>>  getter DOMString item(in unsigned long index);
>>  void add(in MediaTimeRange range);
>>  void remove(in MediaTimeRange range);
>>  // these last two look suspiciously similar to appendChild and
>> removeChild
>> }
>>
>> interface HTMLItextlistOverlayWhateverElement : HTMLElement {
>>  attribute MediaTimeRangeList ranges;
>> }
>>
>> The problem I was trying to solve is that of representing the time
>> ranges uniformly regardless of their source. External subtitles can
>> be accessed and modified via MediaTimeRangeList. <timerange> gets
>> mapped into a MediaTimeRangeList. A MediaTimeRangeList can be
>> constructed by scripts.
>
> I don't think it's necessary to make an external timed text resource
> look the same as built in <timerange> elements. What's the use case
> for this? Is there any code that will want to treat the two in the
> same way?

The use case is writing the code for handling timed text in the  
site-specific awesome manner (see above) only once and be able to use it  
regardless of the source of that timed text.

>> However, make note of the FIXME. Because not all external subtitle
>> formats can be represented as plain text, there are basically 3
>> options:
>>
>> 1. Make the content completely opaque. Makes modification
>> impossible, but the same is true of almost any external resource.
>>
>> 2. Reduce the content to plain text. Modification would then destroy
>> what extra style information there was.
>
> Is there a use case for getting the text of the currently displayed
> subtitle? If subtitles are implemented client-side, I could see it,
> but as a built-in UA behavior, I'm not sure why you would need this.

See above.

>> 3. Transcode the content to HTML+CSS. Basically, while parsing
>> external SRT, the UA would construct an equivalent HTML DOM as
>> children of <itextlist-overlay-whatever>. This would actually make
>> the MediaTimeRange idea above redundant because the information
>> would already be in the DOM. All in all though, this would be quite
>> strange and not a serious suggestion.
>
> That seems pretty complicateded. Again, what's the use case?

It's what you would have to do to allow run-time manipulation of complex  
subtitle formats which cannot be represented as plain text. In this case I  
think it's very clear that the complexity isn't justified, so let's move  
on.

>> Looking at the above, trying to force time ranges from all sources
>> into a single interface isn't looking good. Perhaps the whole effort
>> is misguided.
>
> I don't think there needs to be a single interface that abstracts
> built-in support for timed text formats, and script-defined time ranges.
>
>> Is there in fact a use case for accessing/modifying the time ranges
>> and contents of external subtitles? For getting callbacks/events
>> when such ranges are entered and left? For styling such content with
>> CSS?
>
> I don't think there is. Maybe for styling with CSS, for formats that
> don't have their own styling, but you could handle that with a single
> pseudo-element, or a style rule for the embedding element.

I realize I'm backtracking somewhat here, but I'm being quite serious when  
I say that I would like to be able to write some UserJS which displayes  
captions in scrolling IRC log format below a video regardless of the  
origin of those captions. However, trying to expose arbitrary formats via  
the DOM just isn't going to work, so it's either text-only or nothing at  
all, it seems.

>> By throwing away all interaction between external subtitles and the
>> DOM, cross-origin issues become irrelevant. The only use case I
>> think is actually... useful... is styling it with CSS. For now, I
>> will abandon the working assumption about SRT firing events, etc.
>>
>>> Another possibility is that <timerange> elements have contents
>>> which automatically become visible or hidden depending on whether
>>> content is in the range, so the common use case (make some content
>>> appear during certain time ranges of the video) work without any
>>> script:
>>>
>>> <video>
>>>     <source type="video/mp4" src="video.m4v">
>>>     <timerange start="10" end="12">Hello</timerange>
>>> </video>
>>>
>>> The contents could be arbitrary HTML, which would make it very
>>> simple to sync a slideshow to a video, in addition to handling the
>>> captions use case. CSS styling could be used to position the
>>> currently visible <timerange> over the video.
>>>
>>
>> I quite like the declarative syntax in the last example, but think
>> that <timerange> should have a wrapping element which is the same
>> used to reference external time ranges (a.k.a. subtitles). Mostly
>> this is to group them into "tracks".
>>
>> <video>
>>  <source type="video/mp4" src="video.m4v">
>>  <itextlist-overlay-whatever lang="zh" src="chinese.srt"></itextlist-
>> overlay-whatever>
>>  <itextlist-overlay-whatever lang="en">
>>    <timerange start="10" end="12">Hello</timerange>
>>  </itextlist-overlay-whatever>
>>  <itextlist-overlay-whatever lang="sv">
>>    <timerange start="10" end="12">Hej</timerange>
>>  </itextlist-overlay-whatever>
>> </video>
>
> If the element to embed a timed text file were separate from anything
> related to <timerange>, then we could just call it <timedtext>. That
> would be a nice, semantic name. And perhaps we could even define that
> it could be used outside a <video> or <audio>, in which case it gets
> its own set of controls.

Using <timedtext> for <audio> would certainly be useful, but standalone?  
What timeline should it sync again? What would you achieve that hasn't  
been possible for years with some CSS+JavaScript? But sure, when  
everything else is sorted out if it turns out that it just happens to be  
reused without added complexity I don't see why not.

> I understand your goal here with language selection. If <timedtext>
> needs to have a lang attribute, then perhaps <timerange> could as
> well, and organizing into tracks can be done by the user. However,
> conventionally, language negotiation is not done at the HTML level.
> Usually it is done via the Accept-Language header (HTTP content
> negotiation) or through a site-specific setting stored as a Cookie or
> via guessing based on IP, all server-side. I would be hesitant to make
> the content negotiation work at the HTML level here, unless there
> really is no workaround for some specific use case.

Actually, I'm not suggesting automatic language selection by the UA at  
this point, precisely because I think negotiation via Accept-Language has  
failed. As I said, the main purpose is to group tracks to allow UAs and  
scripts to chose between them. lang="" does nothing more on <source> than  
on any other element (help speak synthesizers and search engines?)

> For purely organizational purposes, you could easily use HTML comments
> or a <div> to group <timerange> elements, if you really care to.

That doesn't help UAs to enable/disable such groups.

>> I suppose that for styling, we would have a CSS pseudo-
>> classe :yourtimeisnow ? A probably default style would then be
>>
>> timerange { display:none; }
>> timerange:yourtimeisnow { display: block; }
>>
>> If we use some declarative time range syntax, surely the next thing
>> people will want is to be able to use it outside of <video>.
>
> Maybe :current would be a better name.

Definitely.

>> <video id="v0" src="my-video"></video>
>> Subtitles below:
>> <div>
>>  <timerange start="10" end="12" ref="v0">Hello</timerange>
>> </div>
>>
>> Good idea? When people inevitably ask for this, I think we should
>> tell them to do it with scripts instead.
>
> I think we should have some convenient way to put the <timerange>
> content outside the visual area of the video, for the use case of a
> slide show synchronized to a video of the accompanying presentation.
> In that case you definitely want the <timerange> contents below, not
> overlayed. I think perhaps the default presentation should be to put
> the timerange content *after* the video, since it is easier to CSS
> position into a box than out of it. So you could do
>
> video.myVid > timerange {
>       text-align: center;
>       position: absolute;
>       width: 100%;
>       bottom: 15px;
> }
>
> This should position the contents of time ranges at 15px from the
> bottom of the video. (CSS not tested but I bet something similar would
> work.)
>
> To make this convenient, we could introduce a presentational boolean
> attribute "inside" on <timerange> which does something like this
> automatically. Or we could make being inside the default and have an
> "outside" attribute.
>
> I can see how the "ref" functionality is more general, though, so it
> seems like a valid alternative.

Wouldn't @inside and @outside be purely presentational? If by "ref" you  
mean that in order to put content outside of <video> you simply place it  
outside in the markup and ref the video by id, I strongly agree. However,  
if actual styling is all done using a :current pseudo-selector, what  
meaningful semantics does <timerange> have? Apart from styling, the  
content is the same before, during and after the range. How about we  
simply use an attribute or push everything over to CSS?

<div t="10s,12s">Hello</div>

(oops, presentational attribute?)

Here I'm reusing the temporal dimension syntax from media fragments. [1]

<div style="time-range:10s 12s">Hello</div>

Instead of a :current selector, I think we should have 3 of them:

:before-range
:in-range
:after-range

This would enable e.g. a live transcript with different styles for  
past/current/future dialog with transitions in between. Possibly the next  
step would be pseudo-selectors with time range, e.g. :time-range[10s 12s]  
so that the same element can have several ranges with different styles.  
Comments?

And now for something completely different...

Going back to the core of my <overlay> idea, possibly what we need is a  
CSS display type that when applied to children of replaced content, causes  
that content to be visible inside/ontop the replaced content element with  
that element as the offset parent (so that position:absolute;bottom:0  
works aligns the child content to the bottom of the replaced content).

<style>
video > .controls {
   display: overlay;
   position: absolute;
   left: 0;
   right: 0;
   bottom: 0;
}
</style>
<video>
<div class="controls"><!-- buttons here --></div>
</video>

Crucially, this allows the UA to keep the controls in fullscreen mode.  
Suggestions for better naming are welcome as always.

[1]  
http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec/#naming-time

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Monday, 30 November 2009 23:00:30 UTC