- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Mon, 26 Jul 2010 16:46:16 +1000
On Fri, Jul 23, 2010 at 3:40 PM, Ian Hickson <ian at hixie.ch> wrote: > > I recently added to the HTML spec a mechanism by which external subtitles > and captions can be added to videos in HTML. > > In designing this feature I went through hundreds and hundreds of e-mails, > blogs, proposals, etc, trying to get all the key use cases that needed > handling. (Replies to the WHATWG e-mails on the topic are included below.) Let me start by congratulating Ian on this piece of work. I think it has been a massive effort and many good ideas have been had and are now introduced into the specification (though not all ;-). While I believe we still have some discussions ahead of us and several improvements to make before implementations should be considered, I certainly think it's a huge step forward. > The proposal consists of several components: > > ?- A <track> element for linking to timed tracks from the markup. > ?- A DOM API for manipulating timed tracks dynamically. > ?- A specification for a simple captioning format. > ?- A set of rules and processing models to hold it all together. I want to give feedback on the captioning format in a different thread, because this is indeed where I have the most issues and I want to look at that in a larger context. Here, I want to give feedback on the other three dimensions, which in my opinion should anyway be independent of the caption format of choice. 1. The <track> element ================== http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#the-track-element Seeing as a lot of previous proposals and contributions have gone into this part of the specification, there is not much to criticize. I still have some statements, questions and suggestions. ** One open question is still the one of formats: > On Thu, 16 Jul 2009, Silvia Pfeiffer wrote: >> * the "type" attribute is meant to both identify the mime type of the >> format and the character set used in the file. > > It's not clear that the former is useful. The latter may be useful; I > haven't supported that yet. If the element is to support a single format in a single character set, then there is no need for a MIME type. So, we need to be clear whether we want to restrict our option here for multiple formats. If we choose a format now that is difficult to extend in the future to features that we may not be considering now (e.g. SVG in caption cues, even videos in caption cues as a sort of picture-in-picture), we may need to support a second format later and then introduce the @type attribute along similar lines to the audio and video elements. ** Further, the charset question: >> The character set question is actually a really difficult problem to get >> right, because srt files are created in an appropriate character set for >> the language, but there is no means to store in a srt file what >> character set was used in its creation. That's a really bad situation to >> be in for the Web server, who can then only take an educated guess. By >> giving the ability to the HTML author to specify the charset of the srt >> file with the link, this can be solved. > > Yeah, if this is a use case people are concerned about, then I agree that > a solution at the markup level makes sense. If we really are to use WebSRT because (amongst other reasons) it allows reuse of existing srt files, then we need to introduce a means to provide the charset, since almost none of the srt files in the wild that I have looked at were in UTF-8, but in all sorts of other character sets. Another solution to this problem would be to have WebSRT know what charset their characters are in - then we don't need to add such information to the <track> element. It will still not work with legacy SRT files though. ** Then the question of default activation: > On Fri, 31 Jul 2009, Silvia Pfeiffer wrote: >> * It is unclear, which of the given alternative text tracks in different >> languages should be displayed by default when loading an <itext> >> resource. A @default attribute has been added to the <itext> elements to >> allow for the Web content author to tell the browser which <itext> >> tracks he/she expects to be displayed by default. If the Web author does >> not specify such tracks, the display depends on the user agent (UA - >> generally the Web browser): for accessibility reasons, there should be a >> field that allows users to always turn display of certain <itext> >> categories on. Further, the UA is set to a default language and it is >> this default language that should be used to select which <itext> track >> should be displayed. > > It's not clear to me that we need a way to do this; by default presumably > tracks would all be off unless the user wants them, in which case the > user's preferences are paramount. That's what I've specced currently. > However, it's easy to override this from script. Web page authors probably want a means to turn on certain tracks by default and not just leave it to the UA to select based on preferences, which they can then override with JavaScript. In fact, that seems to go against the principle approach that the author suggests, the UA preferences override, and the user has ultimate control. Where this currently falls down is the "author suggestion", IMO. The description on the @controls attribute in http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#user-interface includes the mention of caption tracks and thus covers the interactive selection through the user. But turning tracks on/off with JavaScript would overrule UA preference settings and thus not provide an author suggestion, but rather an author override. We can fix this by turning the mode IDL attribute also into a content attribute on the <track> element. ** Next the feature of fixing "stretch" and "drift": >> * Another typical feature of time-aligned text files is that they may be >> out of sync with the actual video file. For that purpose, a @delay >> attribute was suggested as an addition to the <itext> element. This has >> not been implemented into the demo. In the feedback to this proposal, a >> further "stretch" or "drift" attribute was suggested. > > I haven't added this yet, but it's an interesting idea (possibly best kept > until a "v2" release though). One can implement this from script by > creating a new track that simply copies the previous one cue-for-cue with > an offset applied, so we'll be able to see if this is something for which > there is real demand by seeing if anyone does that. I agree that these are "v2" features and we still need to prove that there is a big need for it. ** the list of track kinds: You mention that karaoke and lyrics are supported by WebSRT, so could we add them to the track kinds? ** a @media media query attribute: In the proposal at http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations a @media attribute was suggested. The idea is that the @media attribute would contain a media query describe what user environment, e.g. what devices the text track is suitable for. If for example subtitles require a minimum of 30 characters width to be displayed properly, but certain devices cannot support this, the subtitles would be pretty useless on such a device. Seeing as the <source> elements on media elements have that attribute, too, it wouldn't be too difficult to implement the same here. Is this a "v2" feature or is it considered to be added? 2. The DOM API for manipulating timed tracks dynamically ============================================= http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#htmltrackelement and http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-track-api and http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#sourcing-in-band-timed-tracks This API has been defined with the addition of a readonly list of TimedTrack into a HTMLMediaElement. A TimedTrack in turn consists of a TimedTrackCueList with all available cues in that track and a TimedTrackCueList with the active cues only. A TimedTrack can be created through three different means: (1) a <track> element as a child of a media element, which includes a TimedTrack interface, (2) a MutableTimedTrack interface, which extends the TimedTrack interface with two functions to add and remove cues on the fly, thus enabling a scripted creation of a TimedTrack without having a <track> element declared - the track is created used the addTrack() method of the media element, and (3) sourcing a TimedTrack from in-band of a media resource, which is also added to the resource fetching algorithm. (NOTE: there is a typo in section 4.8.10.10.5 when describing MutableTimedTrack - in the green box, addCue() is repeated, but the second one should be called removeCue() ). ** in-band tracks I wonder about the order in which <track> elements, mutable tracks, and in-band TimedTracks are held. 4.8.10.10.1 states the above order (i.e. <track> first, then mutable, then in-band). That <track> comes first makes sense, since it is possible that different browsers choose different media resources which may have different in-band tracks. Thus, at least the numbering across the TimedTracks from <track> elements is consistent. However, the in-band tracks will be available after the media resource has been parsed, while the mutable tracks are script-created and could be created dynamically through user interaction. Does that mean, that the index of the in-band tracks can change during the course of the Web page depending on how many mutable tracks are available at a time? I would probably also more explicitly state that in-band tracks are only chosen out of the media resource that is in @currentSrc. ** TimedTrackCue I am concerned about the definition of the TimedTrackCue. http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timedtrackcue It has the following IDL attributes: readonly attribute DOMString direction; readonly attribute boolean snapToLines; readonly attribute long linePosition; readonly attribute long textPosition; readonly attribute long size; readonly attribute DOMString alignment; All of these are related to CSS attributes and I wonder how that interacts. For example, what if the @direction says "vertical" and the CSS attribute for the cue says direction:rtl; ? I believe we do not need these attributes since diverse CSS properties will cover for all of them. I am also confused about the snapToLines and linePosition attributes: IIUC the linePosition is meant to be either a percentage of the video dimensions or a line position relative to the first line of the cue. Does that latter mean an offset from where the first line of the cue should theoretically be? What is the purpose of it? Does the earlier mean that we can only provide text for video and not for audio, which has no dimensions? What if we have a lyrics file for a piece of music? Can that not be rendered? And what if we wanted to render captions underneath a video rather than inside video dimensions? Can that be achieved somehow? ** adding cue ranges > On Thu, 16 Jul 2009, Philip J?genstedt wrote: >> As far as I can tell no browser wants to implement the addCueRange API >> (removing this should be the topic of a separate mail), so we really >> need to re-think this part and I think that timed text plays an >> important part here. > > The addCueRange() API has been removed and replaced with a feature based > on the subtitle mechanism. IIRC, the use cases for addCueRange() as a JavaScript function were: * the alignment of text cues with time ranges as in captions, subtitles etc * enabling the time-accurate activation and deactivation of certain activities such as showing a slide and moving to the next The first one is easily covered with the MutableTimedTrack interface and the addCue(in TimedTrackCue cue) function. The second one should be met through the onenter and onexit events on the TimedTrackCue: http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#cue-events I look forward to experimenting with these! ** Linking into and out of a cue-range In http://www.mail-archive.com/whatwg at lists.whatwg.org/msg10395.html Dave Singer wrote: > Linking into a cue-range would be using its beginning or end as a seek point, or its duration as a restricted view of the media > ("only show me cue-range called InTheBathroom"). Linking out of a cue-range would be establishing a click-through URL that > would be dispatched directly if the user clicked on the media during that range (dispatched without script). I believe in these use cases, too. It is possible to jump to a cue range through its number in the list in the media element using JavaScript and setting the @currentTime to that cue range's start time. However, it has not yet been defined whether there is a relationship between media fragment URIs and timed tracks. The media framgent URI specification has such URIs defined as e.g. http://example.com/video.ogv#id="InTheBathroom" and cues have a textual identifier, so we can put these two together to enable this. Such URIs will then be able to be used in the @src attribute or a media element and focus the view on that cue, just like temporal media fragments do with a random time range. For linking out of a cue, there is a need to allow having hyperlinks in cues. IIUC this is currently only possible by using a HTML-style markup in the cue, declaring the cue as kind=metadata and calling getCueAsSource() on the cue, then running your own overlays and shoving the retrieved text to the innerHTML of that overlay. While that works, it seems like a lot of hoops to jump through just to be able to use a bit of HTML markup - in particular having to run your own overlay. Could we introduce a kind=htmlfragment type where it is obvious that the text is HTML and that the fragment parser can be run automatically and display it through the given display mechanisms? ** metadata Many existing subtitle formats and similar media-time-aligned text formats contain file-wide name-value pairs that explain metadata for the complete resource. An example are Lyrics files, e.g. On Tue, 20 Apr 2010, Silvia Pfeiffer wrote: > > Lyrics (LRC) files typically look like this: > > [ti:Can't Buy Me Love] > [ar:Beatles, The] > [au:Lennon & McCartney] > [al:Beatles 1 - 27 #1 Singles] > [by:Wooden Ghost] > [re:A2 Media Player V2.2 lrc format] > [ve:V2.20] > [00:00.45]Can't <00:00.75>buy <00:00.95>me <00:01.40>love, > <00:02.60>love<00:03.30>, <00:03.95>love, <00:05.30>love<00:05.60> > [00:05.70]<00:05.90>Can't <00:06.20>buy <00:06.40>me <00:06.70>love, > <00:08.00>love<00:08.90> You can see that there are title, artist, author, album, related content, version and similar metadata information headers on this file. Other examples contain copyright information and usage rights - important information to understand and deal with when distributing media-time-aligned text files on a medium such as the Web. The current TimedTrack platform does not allow and does not deal with such metadata. It would, however, make sense to make such metadata available to the Web page that includes the video with its TimedTracks. This is particularly useful so the Web page can expose this data visibly and it can be included into scraped text for the video for search an similar. Could we introduce a means to have such name-value pairs dealt with in the TimedTrack platform? Maybe by adding something like a list of HTMLMetaElements to the TimedTrack interface? 3. The set of rules and processing models to hold it all together ================================================= http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#timed-tracks and http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0 ** the display of chapter tracks: > I've also included support for chapters. Currently this support is not > really fully fleshed out; in particular it's not defined how a UA should > get chapter names out of the WebSRT file. I would like implementation > feedback on this topic -- what do browser vendors envisage exposing in > their UI when it comes to chapters? Just markers in the timeline? A > dropdown of times? Chapter titles? Styled, unstyled? > > Currently a cue payload can be either cue text (simple markup) or metadata > text (arbitrary data for scripts). We could add a third form consisting of > just plain text for chapter titles, or we could reuse cue text, depending > on what is needed here. Currently the spec requires them to be cue text > but doesn't say how to get plain text out of them. I believe cue text is fine as chapter title. And I would think it'd be good to define a standard means of extracting plain text out of any type of cue, so it will be possible to hand this to e.g. the accessibility API for reading back. I would actually like to see an interface where the chapter makers can be used for navigation through the media resource, e.g. as you are playing back the media file, you can press SHIFT-rightarrow and SHIFT-leftarrow to navigate back and forth within a track (in particularly within a chapter track). This is particularly important for blind users. Whether we expose the navigation visually through markers along the timeline (similar to e.g. Viddler http://smallbiztechnology.com/media/viddler.jpg or TED http://www.arguingwithmyself.com/wordpress/wp-content/uploads/ted-video-player.png do it) or whether we put it into a menu (similar to e.g. the QuickTime player https://peepcode.com/system/chapters/httperf-chapters.png), I am not overly fussed. Incidentally, there are more examples for means of rendering chapters at http://wiki.whatwg.org/wiki/Use_cases_for_API-level_access_to_timed_tracks#Chapter_Markers . ** using metadata kind tracks > In WebSRT, this would be: > > 10:00.000 --> 20:00.000 > { title: "Chapter 2", description: "Some blah relating to chapter 2", image: "/images/chapter2.png" } > > 20:00.000 --> 30:00.000 > { title: "Chapter 3", description: "Chapter 3 blah", image: "/images/chapter3.png" } > > (Here I'm assuming that you want to store the data as JSON. For > kind=metadata files, you can put anything you want in the cue so long as > you don't have a blank line in there.) I think it is a powerful idea to have a track kind that allows for everything. This provides a platform to put absolutely anything into a time-aligned form for a media resource. The standardisation aspect about it is the means in which the association between the data and the media resource happens, such that at least the cues can be extracted in a standard manner. However, it opens up an issue about parsing and display. What would be displayed for such a JSON markup in an overlay? Is there a means to convert this "any text" into text that can and will be automatically displayed? IIUC right now kind=metadata implies that there is no rendering (see explanation of mode=showing, though the text at http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0 doesn't seem to imply it). This is really kinda annoying when the cue data is actually a HTML fragment. Also, the parser for the cue data in the case of kind=metadata would not be part of what the browser offers, so somebody using this approach would need to provide their own JSON parser for the data before they can do anything useful with it. Is there a plan to offer existing parser functionality of the Web browser (e.g. RSS parsing, or Firefox's native JSON parser, or the HTML fragment parser) to the user for this kind of data in some way? ** display of multiple tracks Since the @mode IDL attribute of an individual TimedTrack can take on the value "showing" for several tracks at a time and all tracks of kind "subtitle" or "caption" will be displayed, it is possible that multiple TimedTracks are displaying cues at the same time. The display mechanism at 14.3.2.1 deals with this, which is really cool. However, I wonder if there is a limit to the number of tracks we want to allow rendering for at the same time ** Security > On Fri, 31 Jul 2009, Philip J?genstedt wrote: >> >> * Security. What restrictions should apply for cross-origin loading? > > Currently the files have to be same-origin. My plan is to wait for CORS to > be well established and then use it for timed tracks, video files, images > on <canvas>, text/event-stream resources, etc. I would indeed like to see the possibility to re-use tracks from other locations, such that e.g. a video can be published by one site, but another site provides all the subtitles for it. While the site with the subtitles can embed the video, since the video runs in its own unrelated top-level browsing context, the video site cannot include the subtitles right now. I think that's not a fair situation. Would it be possible to do the same for the text tracks as for the video, i.e. let them be rendered in their own unreated top-level browsing context? (I believe Henri suggested this earlier, too). What would be the advantages/disadvantages? ** rendering > On Sun, 11 Apr 2010, Silvia Pfeiffer wrote: >> On Sun, Apr 11, 2010 at 4:18 PM, Robert O'Callahan wrote: >> > >> > This needs to be clarified. Authors can position arbitrary content >> > over the video, and presumably the browser is not supposed to ensure >> > rendered text doesn't collide with such content. I presume what you >> > meant is simply that rendered text must not collide with browser >> > built-in UI. Although I'm not sure how that can be ensured when >> > arbitrary styling of the rendered text is supported. >> >> Yes, the idea was for browser built-in default UI controls. [...] >> >> The main issue is to keep the area that captions or subtitles are >> rendered into and the area that controls are rendered into separately, >> since you will always want to have access to both if both are activated. > > I've made sure that WebSRT titles avoid overlapping the controls. The biggest issues I have with the way in which this all works is with rendering. I think it's untenable that we can only render TimedTracks on top of the video viewport (see http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0). There is no means of rendering for audio and no means of rendering outside the video element. I can see where this comes from: only the video has an actual visible dimension that can be relied upon. This is why I would approach rendering not from the viewpoint of the media resource, but from the viewpoint of the containing Web page, which has a lot more space to deal with than the media element. My preferred approach for rendering would be something like this: (1) the Web page provides the rendering area to the text track. This may be the video viewport, or it may be some other anonymous block on the page that the text should be rendered into. The dimensions could be set through CSS directly for all track elements (e.g. video > track), if possible also for MutableTimedTracks and for in-line TimedTracks. If no dimensions are given, the default is the video viewport or for audio a space defined above the audio controls, e.g. a one line high box. (2) a caption format can provide hints to the Web page as to what rendering it is built towards, such as the video viewport or a specific min width and height for its text. The algorithm for avoiding overlap with controls is good and needs to still be executed when the rendering is the video viewport. Also, it still needs to deal with multiple tracks all trying to render into the same box, so avoidance here is important, too. I think the approach of creating multiple boxes inside the display box is a good one and should work for any display box that the Web page provides, not just with the video viewport. ** styling of in-band TimedTracks and MutableTimedTracks The rendering and CSS styling approach with ::cue described in http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0 is only defined on WebSRT. That means that there is no styling possible for TimedTracks that come from a different format (assuming we may allow other formats in future). Also, it implies that there is no styling possible for in-band TimedTracks and for MutableTimedTracks. I think this is a bit restrictive and would rather we define a mechanism to allow CSS styling of cues that come from any type of TimedTrack, and thus make the CSS styling part independent of the format. Also, the actual CSS properties that are allowed are very restrictive - only the following are allowed: * 'color' * 'text-shadow' * 'text-outline' * the properties corresponding to the 'background' shorthand * the properties corresponding to the 'outline' shorthand * the properties corresponding to the 'font' shorthand, including 'line-height' A similar restriction is given for cues: * 'color' * 'text-shadow' * 'text-outline' * the properties corresponding to the 'background' shorthand * the properties corresponding to the 'outline' shorthand * properties relating to the transition and animation features IMO that defeats the reason for using CSS. The argument that all of CSS, including future extensions, will be available to TimedTracks is only half-true: the use of CSS is restricted to the given list here, so it's not making use of all of CSS and its not automatically extensible. I think that's a poor use of the opportunity that CSS poses. Uff, this took longer to write than I expected. I'm hoping to get some good discussions out of it on the purpose and aim of the TimedTrack platform, and more concretely about the individual properties I have mentioned. Cheers, Silvia.
Received on Sunday, 25 July 2010 23:46:16 UTC