[whatwg] Timed tracks for <video>

On Fri, Jul 23, 2010 at 3:40 PM, Ian Hickson <ian at hixie.ch> wrote:
>
> I recently added to the HTML spec a mechanism by which external subtitles
> and captions can be added to videos in HTML.
>
> In designing this feature I went through hundreds and hundreds of e-mails,
> blogs, proposals, etc, trying to get all the key use cases that needed
> handling. (Replies to the WHATWG e-mails on the topic are included below.)

Let me start by congratulating Ian on this piece of work. I think it
has been a massive effort and many good ideas have been had and are
now introduced into the specification (though not all ;-). While I
believe we still have some discussions ahead of us and several
improvements to make before implementations should be considered, I
certainly think it's a huge step forward.


> The proposal consists of several components:
>
> ?- A <track> element for linking to timed tracks from the markup.
> ?- A DOM API for manipulating timed tracks dynamically.
> ?- A specification for a simple captioning format.
> ?- A set of rules and processing models to hold it all together.

I want to give feedback on the captioning format in a different
thread, because this is indeed where I have the most issues and I want
to look at that in a larger context. Here, I want to give feedback on
the other three dimensions, which in my opinion should anyway be
independent of the caption format of choice.


1. The <track> element
==================
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#the-track-element

Seeing as a lot of previous proposals and contributions have gone into
this part of the specification, there is not much to criticize. I
still have some statements, questions and suggestions.


** One open question is still the one of formats:

> On Thu, 16 Jul 2009, Silvia Pfeiffer wrote:
>> * the "type" attribute is meant to both identify the mime type of the
>> format and the character set used in the file.
>
> It's not clear that the former is useful. The latter may be useful; I
> haven't supported that yet.

If the element is to support a single format in a single character
set, then there is no need for a MIME type. So, we need to be clear
whether we want to restrict our option here for multiple formats. If
we choose a format now that is difficult to extend in the future to
features that we may not be considering now (e.g. SVG in caption cues,
even videos in caption cues as a sort of picture-in-picture), we may
need to support a second format later and then introduce the @type
attribute along similar lines to the audio and video elements.


** Further, the charset question:

>> The character set question is actually a really difficult problem to get
>> right, because srt files are created in an appropriate character set for
>> the language, but there is no means to store in a srt file what
>> character set was used in its creation. That's a really bad situation to
>> be in for the Web server, who can then only take an educated guess. By
>> giving the ability to the HTML author to specify the charset of the srt
>> file with the link, this can be solved.
>
> Yeah, if this is a use case people are concerned about, then I agree that
> a solution at the markup level makes sense.

If we really are to use WebSRT because (amongst other reasons) it
allows reuse of existing srt files, then we need to introduce a means
to provide the charset, since almost none of the srt files in the wild
that I have looked at were in UTF-8, but in all sorts of other
character sets. Another solution to this problem would be to have
WebSRT know what charset their characters are in - then we don't need
to add such information to the <track> element. It will still not work
with legacy SRT files though.


** Then the question of default activation:

> On Fri, 31 Jul 2009, Silvia Pfeiffer wrote:
>> * It is unclear, which of the given alternative text tracks in different
>> languages should be displayed by default when loading an <itext>
>> resource. A @default attribute has been added to the <itext> elements to
>> allow for the Web content author to tell the browser which <itext>
>> tracks he/she expects to be displayed by default. If the Web author does
>> not specify such tracks, the display depends on the user agent (UA -
>> generally the Web browser): for accessibility reasons, there should be a
>> field that allows users to always turn display of certain <itext>
>> categories on. Further, the UA is set to a default language and it is
>> this default language that should be used to select which <itext> track
>> should be displayed.
>
> It's not clear to me that we need a way to do this; by default presumably
> tracks would all be off unless the user wants them, in which case the
> user's preferences are paramount. That's what I've specced currently.
> However, it's easy to override this from script.

Web page authors probably want a means to turn on certain tracks by
default and not just leave it to the UA to select based on
preferences, which they can then override with JavaScript. In fact,
that seems to go against the principle approach that the author
suggests, the UA preferences override, and the user has ultimate
control. Where this currently falls down is the "author suggestion",
IMO. The description on the @controls attribute in
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#user-interface
includes the mention of caption tracks and thus covers the interactive
selection through the user. But turning tracks on/off with JavaScript
would overrule UA preference settings and thus not provide an author
suggestion, but rather an author override. We can fix this by turning
the mode IDL attribute also into a content attribute on the <track>
element.


** Next the feature of fixing "stretch" and "drift":

>> * Another typical feature of time-aligned text files is that they may be
>> out of sync with the actual video file. For that purpose, a @delay
>> attribute was suggested as an addition to the <itext> element. This has
>> not been implemented into the demo. In the feedback to this proposal, a
>> further "stretch" or "drift" attribute was suggested.
>
> I haven't added this yet, but it's an interesting idea (possibly best kept
> until a "v2" release though). One can implement this from script by
> creating a new track that simply copies the previous one cue-for-cue with
> an offset applied, so we'll be able to see if this is something for which
> there is real demand by seeing if anyone does that.

I agree that these are "v2" features and we still need to prove that
there is a big need for it.


** the list of track kinds:

You mention that karaoke and lyrics are supported by WebSRT, so could
we add them to the track kinds?


** a @media media query attribute:

In the proposal at
http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations a @media
attribute was suggested. The idea is that the @media attribute would
contain a media query describe what user environment, e.g. what
devices the text track is suitable for. If for example subtitles
require a minimum of 30 characters width to be displayed properly, but
certain devices cannot support this, the subtitles would be pretty
useless on such a device. Seeing as the <source> elements on media
elements have that attribute, too, it wouldn't be too difficult to
implement the same here.

Is this a "v2" feature or is it considered to be added?


2.  The DOM API for manipulating timed tracks dynamically
=============================================
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#htmltrackelement
and
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-track-api
and
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#sourcing-in-band-timed-tracks

This API has been defined with the addition of a readonly list of
TimedTrack into a HTMLMediaElement. A TimedTrack in turn consists of a
TimedTrackCueList with all available cues in that track and a
TimedTrackCueList with the active cues only.

A TimedTrack can be created through three different means:
(1) a <track> element as a child of a media element, which includes a
TimedTrack interface,
(2) a MutableTimedTrack interface, which extends the TimedTrack
interface with two functions to add and remove cues on the fly, thus
enabling a scripted creation of a TimedTrack without having a <track>
element declared - the track is created used the addTrack() method of
the media element, and
(3) sourcing a TimedTrack from in-band of a media resource, which is
also added to the resource fetching algorithm.

(NOTE: there is a typo in section 4.8.10.10.5 when describing
MutableTimedTrack - in the green box, addCue() is repeated, but the
second one should be called removeCue() ).


** in-band tracks

I wonder about the order in which <track> elements, mutable tracks,
and in-band TimedTracks are held. 4.8.10.10.1  states the above order
(i.e. <track> first, then mutable, then in-band). That <track> comes
first makes sense, since it is possible that different browsers choose
different media resources which may have different in-band tracks.
Thus, at least the numbering across the TimedTracks from <track>
elements is consistent. However, the in-band tracks will be available
after the media resource has been parsed, while the mutable tracks are
script-created and could be created dynamically through user
interaction. Does that mean, that the index of the in-band tracks can
change during the course of the Web page depending on how many mutable
tracks are available at a time?

I would probably also more explicitly state that in-band tracks are
only chosen out of the media resource that is in @currentSrc.


** TimedTrackCue

I am concerned about the definition of the TimedTrackCue.
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timedtrackcue

It has the following IDL attributes:
  readonly attribute DOMString direction;
  readonly attribute boolean snapToLines;
  readonly attribute long linePosition;
  readonly attribute long textPosition;
  readonly attribute long size;
  readonly attribute DOMString alignment;

All of these are related to CSS attributes and I wonder how that
interacts. For example, what if the @direction says "vertical" and the
CSS attribute for the cue says direction:rtl; ?

I believe we do not need these attributes since diverse CSS properties
will cover for all of them.

I am also confused about the snapToLines and linePosition attributes:
IIUC the linePosition is meant to be either a percentage of the video
dimensions or a line position relative to the first line of the cue.
Does that latter mean an offset from where the first line of the cue
should theoretically be? What is the purpose of it?
Does the earlier mean that we can only provide text for video and not
for audio, which has no dimensions? What if we have a lyrics file for
a piece of music? Can that not be rendered?
And what if we wanted to render captions underneath a video rather
than inside video dimensions? Can that be achieved somehow?


** adding cue ranges

> On Thu, 16 Jul 2009, Philip J?genstedt wrote:
>> As far as I can tell no browser wants to implement the addCueRange API
>> (removing this should be the topic of a separate mail), so we really
>> need to re-think this part and I think that timed text plays an
>> important part here.
>
> The addCueRange() API has been removed and replaced with a feature based
> on the subtitle mechanism.

IIRC, the use cases for addCueRange() as a JavaScript function were:
* the alignment of text cues with time ranges as in captions, subtitles etc
* enabling the time-accurate activation and deactivation of certain
activities such as showing a slide and moving to the next

The first one is easily covered with the MutableTimedTrack interface
and the addCue(in TimedTrackCue cue) function.

The second one should be met through the onenter and onexit events on
the TimedTrackCue:
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#cue-events

I look forward to experimenting with these!


** Linking into and out of a cue-range

In http://www.mail-archive.com/whatwg at lists.whatwg.org/msg10395.html
Dave Singer wrote:
> Linking into a cue-range would be using its beginning or end as a seek point, or its duration as a restricted view of the media
> ("only show me cue-range called InTheBathroom"). Linking out of a cue-range would be establishing a click-through URL that
> would be dispatched directly if the user clicked on the media during that range (dispatched without script).

I believe in these use cases, too.

It is possible to jump to a cue range through its number in the list
in the media element using JavaScript and setting the @currentTime to
that cue range's start time. However, it has not yet been defined
whether there is a relationship between media fragment URIs and timed
tracks. The media framgent URI specification has such URIs defined as
e.g. http://example.com/video.ogv#id="InTheBathroom" and cues have a
textual identifier, so we can put these two together to enable this.
Such URIs will then be able to be used in the @src attribute or a
media element and focus the view on that cue, just like temporal media
fragments do with a random time range.

For linking out of  a cue, there is a need to allow having hyperlinks
in cues. IIUC this is currently only possible by using a HTML-style
markup in the cue, declaring the cue as kind=metadata and calling
getCueAsSource() on the cue, then running your own overlays and
shoving the retrieved text to the innerHTML of that overlay.

While that works, it seems like a lot of hoops to jump through just to
be able to use a bit of HTML markup - in particular having to run your
own overlay. Could we introduce a kind=htmlfragment type where it is
obvious that the text is HTML and that the fragment parser can be run
automatically and display it through the given display mechanisms?


** metadata

Many existing subtitle formats and similar media-time-aligned text
formats contain file-wide name-value pairs that explain metadata for
the complete resource. An example are Lyrics files, e.g.

On Tue, 20 Apr 2010, Silvia Pfeiffer wrote:
>
> Lyrics (LRC) files typically look like this:
>
> [ti:Can't Buy Me Love]
> [ar:Beatles, The]
> [au:Lennon & McCartney]
> [al:Beatles 1 - 27 #1 Singles]
> [by:Wooden Ghost]
> [re:A2 Media Player V2.2 lrc format]
> [ve:V2.20]
> [00:00.45]Can't <00:00.75>buy <00:00.95>me <00:01.40>love,
> <00:02.60>love<00:03.30>, <00:03.95>love, <00:05.30>love<00:05.60>
> [00:05.70]<00:05.90>Can't <00:06.20>buy <00:06.40>me <00:06.70>love,
> <00:08.00>love<00:08.90>

You can see that there are title, artist, author, album, related
content, version and similar metadata information headers on this
file. Other examples contain copyright information and usage rights -
important information to understand and deal with when distributing
media-time-aligned text files on a medium such as the Web.

The current TimedTrack platform does not allow and does not deal with
such metadata. It would, however, make sense to make such metadata
available to the Web page that includes the video with its
TimedTracks. This is particularly useful so the Web page can expose
this data visibly and it can be included into scraped text for the
video for search an similar.

Could we introduce a means to have such name-value pairs dealt with in
the TimedTrack platform? Maybe by adding something like a list of
HTMLMetaElements to the TimedTrack interface?



3.  The set of rules and processing models to hold it all together
=================================================
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-tracks
and
http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0


** the display of chapter tracks:

> I've also included support for chapters. Currently this support is not
> really fully fleshed out; in particular it's not defined how a UA should
> get chapter names out of the WebSRT file. I would like implementation
> feedback on this topic -- what do browser vendors envisage exposing in
> their UI when it comes to chapters? Just markers in the timeline? A
> dropdown of times? Chapter titles? Styled, unstyled?
>
> Currently a cue payload can be either cue text (simple markup) or metadata
> text (arbitrary data for scripts). We could add a third form consisting of
> just plain text for chapter titles, or we could reuse cue text, depending
> on what is needed here. Currently the spec requires them to be cue text
> but doesn't say how to get plain text out of them.

I believe cue text is fine as chapter title. And I would think it'd be
good to define a standard means of extracting plain text out of any
type of cue, so it will be possible to hand this to e.g. the
accessibility API for reading back.

I would actually like to see an interface where the chapter makers can
be used for navigation through the media resource, e.g. as you are
playing back the media file, you can press SHIFT-rightarrow and
SHIFT-leftarrow to navigate back and forth within a track (in
particularly within a chapter track). This is particularly important
for blind users.

Whether we expose the navigation visually through markers along the
timeline (similar to e.g. Viddler
http://smallbiztechnology.com/media/viddler.jpg or TED
http://www.arguingwithmyself.com/wordpress/wp-content/uploads/ted-video-player.png
do it) or whether we put it into a menu (similar to e.g. the QuickTime
player https://peepcode.com/system/chapters/httperf-chapters.png), I
am not overly fussed. Incidentally, there are more examples for means
of rendering chapters at
http://wiki.whatwg.org/wiki/Use_cases_for_API-level_access_to_timed_tracks#Chapter_Markers
.


** using metadata kind tracks

> In WebSRT, this would be:
>
>  10:00.000 --> 20:00.000
>  { title: "Chapter 2", description: "Some blah relating to chapter 2", image: "/images/chapter2.png" }
>
>  20:00.000 --> 30:00.000
>  { title: "Chapter 3", description: "Chapter 3 blah", image: "/images/chapter3.png" }
>
> (Here I'm assuming that you want to store the data as JSON. For
> kind=metadata files, you can put anything you want in the cue so long as
> you don't have a blank line in there.)

I think it is a powerful idea to have a track kind that allows for
everything. This provides a platform to put absolutely anything into a
time-aligned form for a media resource. The standardisation aspect
about it is the means in which the association between the data and
the media resource happens, such that at least the cues can be
extracted in a standard manner. However, it opens up an issue about
parsing and display.

What would be displayed for such a JSON markup in an overlay? Is there
a means to convert this "any text" into text that can and will be
automatically displayed? IIUC right now kind=metadata implies that
there is no rendering (see explanation of mode=showing, though the
text at http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0
doesn't seem to imply it). This is really kinda annoying when the cue
data is actually a HTML fragment.

Also, the parser for the cue data in the case of kind=metadata would
not be part of what the browser offers, so somebody using this
approach would need to provide their own JSON parser for the data
before they can do anything useful with it. Is there a plan to offer
existing parser functionality of the Web browser (e.g. RSS parsing, or
Firefox's native JSON parser, or the HTML fragment parser) to the user
for this kind of data in some way?


** display of multiple tracks

Since the @mode IDL attribute of an individual TimedTrack can take on
the value "showing" for several tracks at a time and all tracks of
kind "subtitle" or "caption" will be displayed, it is possible that
multiple TimedTracks are displaying cues at the same time. The display
mechanism at 14.3.2.1 deals with this, which is really cool. However,
I wonder if there is a limit to the number of tracks we want to allow
rendering for at the same time


** Security

> On Fri, 31 Jul 2009, Philip J?genstedt wrote:
>>
>> * Security. What restrictions should apply for cross-origin loading?
>
> Currently the files have to be same-origin. My plan is to wait for CORS to
> be well established and then use it for timed tracks, video files, images
> on <canvas>, text/event-stream resources, etc.

I would indeed like to see the possibility to re-use tracks from other
locations, such that e.g. a video can be published by one site, but
another site provides all the subtitles for it. While the site with
the subtitles can embed the video, since the video runs in its own
unrelated top-level browsing context, the video site cannot include
the subtitles right now. I think that's not a fair situation. Would it
be possible to do the same for the text tracks as for the video, i.e.
let them be rendered in their own unreated top-level browsing context?
(I believe Henri suggested this earlier, too). What would be the
advantages/disadvantages?


** rendering

> On Sun, 11 Apr 2010, Silvia Pfeiffer wrote:
>> On Sun, Apr 11, 2010 at 4:18 PM, Robert O'Callahan wrote:
>> >
>> > This needs to be clarified. Authors can position arbitrary content
>> > over the video, and presumably the browser is not supposed to ensure
>> > rendered text doesn't collide with such content. I presume what you
>> > meant is simply that rendered text must not collide with browser
>> > built-in UI. Although I'm not sure how that can be ensured when
>> > arbitrary styling of the rendered text is supported.
>>
>> Yes, the idea was for browser built-in default UI controls. [...]
>>
>> The main issue is to keep the area that captions or subtitles are
>> rendered into and the area that controls are rendered into separately,
>> since you will always want to have access to both if both are activated.
>
> I've made sure that WebSRT titles avoid overlapping the controls.

The biggest issues I have with the way in which this all works is with
rendering.

I think it's untenable that we can only render TimedTracks on top of
the video viewport (see
http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0).
There is no means of rendering for audio and no means of rendering
outside the video element.

I can see where this comes from: only the video has an actual visible
dimension that can be relied upon. This is why I would approach
rendering not from the viewpoint of the media resource, but from the
viewpoint of the containing Web page, which has a lot more space to
deal with than the media element.

My preferred approach for rendering would be something like this:
(1) the Web page provides the rendering area to the text track. This
may be the video viewport, or it may be some other anonymous block on
the page that the text should be rendered into. The dimensions could
be set through CSS directly for all track elements (e.g. video >
track), if possible also for MutableTimedTracks and for in-line
TimedTracks. If no dimensions are given, the default is the video
viewport or for audio a space defined above the audio controls, e.g. a
one line high box.
(2) a caption format can provide hints to the Web page as to what
rendering it is built towards, such as the video viewport or a
specific min width and height for its text.

The algorithm for avoiding overlap with controls is good and needs to
still be executed when the rendering is the video viewport. Also, it
still needs to deal with multiple tracks all trying to render into the
same box, so avoidance here is important, too. I think the approach of
creating multiple boxes inside the display box is a good one and
should work for any display box that the Web page provides, not just
with the video viewport.


** styling of in-band TimedTracks and MutableTimedTracks

The rendering and CSS styling approach with ::cue described in
http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0
is only defined on WebSRT. That means that there is no styling
possible for TimedTracks that come from a different format (assuming
we may allow other formats in future). Also, it implies that there is
no styling possible for in-band TimedTracks and for
MutableTimedTracks. I think this is a bit restrictive and would rather
we define a mechanism to allow CSS styling of cues that come from any
type of TimedTrack, and thus make the CSS styling part independent of
the format.

Also, the actual CSS properties that are allowed are very restrictive
- only the following are allowed:
* 'color'
* 'text-shadow'
* 'text-outline'
* the properties corresponding to the 'background' shorthand
* the properties corresponding to the 'outline' shorthand
* the properties corresponding to the 'font' shorthand, including 'line-height'
A similar restriction is given for cues:
*    'color'
*    'text-shadow'
*    'text-outline'
*    the properties corresponding to the 'background' shorthand
*    the properties corresponding to the 'outline' shorthand
*    properties relating to the transition and animation features

IMO that defeats the reason for using CSS. The argument that all of
CSS, including future extensions, will be available to TimedTracks is
only half-true: the use of CSS is restricted to the given list here,
so it's not making use of all of CSS and its not automatically
extensible. I think that's a poor use of the opportunity that CSS
poses.


Uff, this took longer to write than I expected. I'm hoping to get some
good discussions out of it on the purpose and aim of the TimedTrack
platform, and more concretely about the individual properties I have
mentioned.

Cheers,
Silvia.

Received on Sunday, 25 July 2010 23:46:16 UTC