- From: Philip Jägenstedt <philipj@opera.com>
- Date: Tue, 15 Feb 2011 11:09:19 +0100
On Tue, 15 Feb 2011 04:28:36 +0100, Silvia Pfeiffer <silviapfeiffer1 at gmail.com> wrote: > Hi Philip, > > On Tue, Feb 15, 2011 at 3:27 AM, Philip J?genstedt <philipj at opera.com> > wrote: >> On Wed, 09 Feb 2011 03:57:37 +0100, Silvia Pfeiffer >> <silviapfeiffer1 at gmail.com> wrote: >> >>>>> A. Feedback on the WebVTT format >>>> >>>>> 1. Introduce file-wide metadata >>>>> >>>>> WebVTT requires a structure to add header-style metadata. We are here >>>>> talking about lists of name-value pairs as typically in use for >>>>> header >>>>> information. The metadata can be optional, but we need a defined >>>>> means >>>>> of adding them. >>>>> >>>>> Required attributes in WebVTT files should be the main language in >>>>> use >>>>> and the kind of data found in the WebVTT file - information that is >>>>> currently provided in the <track> element by the @srclang and @kind >>>>> attributes. These are necessary to allow the files to be interpreted >>>>> correctly by non-browser applications, for transcoding or to >>>>> determine >>>>> if a file was created as a caption file or something else, in >>>>> particular the @kind=metadata. @srclang also sets the base >>>>> directionality for BiDi calculations. >>>> >>>> Are there non-browsers that use the language for font-selection or >>>> bidi? >>>> Is >>>> auto-detection not likely to give a better user experience? Are there >>>> any >>>> other use cases for knowing the language of the captions *after* >>>> they've >>>> been opened? >>> >>> >>> I can't see a different way to let non-browser applications know what >>> font to choose, even how to provide the user with a menu of available >>> caption tracks for a video, or to set the base directionality for >>> BiDi. Also, language auto-detection is a huge burden to put onto >>> non-browser applications. Having a readable language tag at the >>> beginning of the file is useful to quickly figure it all out. >>> >>> The language set in <track> would certainly overrule what is in the >>> file. Also, the last language attribute in the header would probably >>> win. >>> >>> I guess it would also be ok to have language and kind optional - >>> different applications may then default to interpreting WebVTT files >>> differently, such as by default English and Captions - or English and >>> Descriptions, but that's probably acceptable from context. >> >> Given that most existing subtitle formats don't have any language >> metadata, >> I'm a bit skeptical. However, if implementors of non-browser players >> want to >> implement WebVTT and ask for this I won't stand in the way (not that I >> could >> if I wanted to). For simplicity, I'd prefer the language metadata from >> the >> file to not have any effect on browsers though, even if no language is >> given >> on <track>. > > There is also the Content-Language response header of HTTP, which > could have an influence on the browser, too. I'm not sure about the > best way to deal with all this overlapping information, but I'm sure > it can be sorted out. My preference is ignoring everything except what is given in <track>. In particular language can't be given in the resource or its headers, because then one has to fetch all the tracks in order to provide a track selection menu with language information or to automatically activate the suitable tracks. >>>> Why do non-browser players need to know the kind? All kinds are >>>> processed >>>> in >>>> the same way except metadata, and there's no reason to use metadata >>>> tracks >>>> with external players. >>> >>> Maybe I have a different view of what applications will make use of >>> WebVTT files than most. My thinking is that there will also be uses >>> for metadata tracks in external applications. Aside from this, there >>> will be authoring applications and players, yes, but there will also >>> be automated processing tools. So, to know what type of content is >>> inside a file without having to look at more than the file's headers >>> is really important. >> >> For both of these cases, putting some magic strings inside comments >> that are >> ignored by browsers sounds like it would be sufficient. Name-value >> metadata >> that is ignored by browsers would be fine as well. > > I'm for the second option: name-value metadata that is ignored by the > browser. I think in fact the browser should in general ignore all > name-value metadata with the exception of file-wide cue settings. I agree, browsers should ignore in-file metadata. (That's one reason I think using comments for it is quite fine most of the time.) >>>>> Further metadata fields that are typically used by authors to keep >>>>> specific authoring information or usage hints are necessary, too. As >>>>> examples of current use see the format of MPlayer mpsub?s header >>>>> metadata [2], EBU STL?s General Subtitle Information block [3], and >>>>> even CEA-608?s Extended Data Service with its StartDate, Station, >>>>> Program, Category and TVRating information [4]. Rather than >>>>> specifying >>>>> a specific subset of potential fields we recommend to just have the >>>>> means to provide name-value pairs and leave it to the negotiation >>>>> between the author and the publisher which fields they expect of each >>>>> other. >>>> >>>> This approach has worked very well with Vorbis Comments, probably >>>> mostly >>>> because all interesting fields have been pre-defined in >>>> http://www.xiph.org/vorbis/doc/v-comment.html >>>> >>>> For a web format though, wouldn't some kind of wiki registry be good >>>> to >>>> avoid total mayhem, especially if there are some predefined fields? >>>> (Not >>>> having file-wide metadata would also avoid such mayhem.) >>> >>> It might be good to define a base set - the Vorbis Comments one or the >>> ID3 ones could be appropriate. Even the old Dublin Core set (the first >>> ones, not the current chaos) could be good. I could also analyse the >>> sets used in current typical caption formats and propose a superset of >>> those. >>> >>> While I think you're right with suggesting a predefined set of fields, >>> I am mostly keen right now to agree on the general format of the >>> fields and how we need to parse them rather than what they actually >>> are. >>> >>> So, I would suggest we allow lines of "name=value" under the WEBVTT >>> magic string. A blank line defines the end of the header section and >>> the beginning of the cues. Would be simple enough to parse, right? >> >> Sure, it's already handled by the current parsing spec, since it ignores >> everything up to the first blank line. > > That's not quite how I'm reading the spec. > > http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#webvtt-0 > allows > "Optionally, either a U+0020 SPACE character or a U+0009 CHARACTER > TABULATION (tab) character followed by any number of characters that > are not U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR) > characters." > after the "WEBVTT FILE" magic. > To me that reads like all of the extra stuff has to be on the same line. > I'd prefer if this read "any character except for two WebVTT line > terminators", then it would all be ready for such header-style > metadata. See steps 12-17 of <http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#parsing-0>, it just skips all lines up to the first blank line. Syntax and parsing are different :) >>>>> 4. Cue formatting requirements >>>>> >>>>> In analysing the available cue formatting functionality, we have >>>>> found >>>>> that some features are missing. Most of these features can be added >>>>> through using CSS on cues that have received a <b>, <i>, <c> or <v> >>>>> marker. The following features are core to traditional TV and exist >>>>> in >>>>> EBU STL and CEA-608/708 captions. Support of these will be a core >>>>> requirement for browsers as well as non-browser applications and it >>>>> makes sense to add these to WebVTT rather than relying on external >>>>> CSS >>>>> which cannot be used for non-browser captions: >>>> >>>> The unstated requirement here seems to be that WebVTT needs to work >>>> as an >>>> interchange format for various TV captioning formats even in user >>>> agents >>>> without any support for CSS (or JavaScript). I'm trying to not make a >>>> straw >>>> man argument, but if want an interchange format, we should pick TTML, >>>> which >>>> is explicitly designed to be just that and doesn't depend on CSS. >>>> >>>> Is it not enough that a lossy conversion can be made from various >>>> formats >>>> into WebVTT+CSS(+JavaScript)? If not, the "Web" in "WebVTT" is highly >>>> misleading... >>> >>> >>> We're trying to avoid the need for multiple transcodings and are >>> trying to achieve something like the following pipeline: >>> broadcast captions -> transcode to WebVTT -> show in browser -> >>> transcode to broadcast devices -> show >>> >>> If we have to plug TTML into this pipeline, too, it will be much >>> slower and we would need to additionally define a mapping from TTML to >>> WebVTT and back. >>> >>> I'm sure with SMPTE-TT around we will end up seeing things like >>> broadcast->TTML->WebVTT->browser, but even then we don't want WebVTT >>> to be a lossy format. >> >> I can only disagree. Trying to make WebVTT into an interchange format >> will >> inevitably turn it into a highly presentational format with lots of >> legacy >> baggage. I can certainly see the use cases for an interchange format, >> but I >> don't think it's worth the added complexity. I'd prefer an approach >> where >> any format quirks that can't be mapped to WebVTT are expressed using >> <c.foo> >> and if it turns out lots of people want the feature, we can add it to a >> future revision. > > I wouldn't go as far as to say it needs to become an interchange > format. But I can see us specifying what the browser parses, while > given options such as the header-metadata and span classes that allow > with some extra information to fully recover the broadcast > functionality. I actually think that is almost possible already. After this thread has run for a while, it'd be nice to hear where you think <c.foo> isn't enough and new markup is needed, if anything. >>>>> * underline: EBU STL, CEA-608 and CEA-708 support underlining of >>>>> characters. The underline character is also particularly important >>>>> for >>>>> some Asian languages. Please make it possible to provide text >>>>> underlines without the use of CSS in WebVTT. >>>> >>>> Which Asian languages? If it's just the Chinese >>>> <http://en.wikipedia.org/wiki/Proper_name_mark>, then I don't think >>>> that >>>> needs <u> or similar. In my experience, use of the Chinese proper name >>>> mark >>>> is in fact extremely rare in Chinese captions, at least in movies and >>>> TV >>>> series from the mainland and Taiwan. It would be best to use e.g. >>>> ???<c.pnm>??</c> to make it easy to change the style between >>>> single/double/wavy/no underline. >>> >>> OK. So if we need underlined text, it will need to be >>> <c.underline>..</c> and CSS underline? I guess in a Web context >>> underline text is usually a hyperlink so it makes sense to discourage >>> <u> for the Web. But is that also an argument for >>> captions/subtitles/descriptions? What is the argument against using >>> <u> in captions? >> >> I don't really have an argument against it, I just questioned that it is >> important for Asian languages in particular. Adding <u> would be really >> simple, it's just a question of why. I've seldom seen underlining in >> captions, so it's not clear to me how it's usually used. > > I'm told <u> is fairly common in traditional captions. We don't do > <c.italics> either for such common stuff. > But if we really don't want this, I guess <c.u> would work, too and is > not that much longer. I can't see any underlining when scanning through the samples at <http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA>. If it is in fact common in some contexts, it'd be great to have samples added to the wiki, I'm sure we could learn something from it. If <u> is actually useful for something, then we should just add it. >>> With "-" you are referring to replacing "-->" with "-" to arrive at >>> things >>> like: >>> 15.000-17.950 >>> At the left we can see... >>> >>> as compared to: >>> 15.000+2.950 >>> At the left we can see... >> >> Yes, that's what I meant. >> >>> I actually think they read fairly given that people are used to the >>> double meaning of "-": to mean both "from ... to" and "minus". >>> But we could use a different character for "absolute time" if you >>> prefer, e.g. "/". >>> 15.000/17.950 >>> At the left we can see... >>> >>> I find this fairly readable, too. >> >> Either would work for me. As I mentioned, the room for improvement here >> isn't only the syntax of the timing line, but also to make it obvious >> that >> cue timestamps like <00:01.000> are relative. Using + for relative >> timestamps is potentially confusing too, as one might think that many >> consecutive <+00:01.000> are cumulative, rather than all being 1 second >> from >> the start time of the cue. > > That's true and in fact the way in which I have authored my examples, > now that I look back at them. It makes the timings smaller and I think > it's a bit more logical. But really we just have to decide on one > meaning: > > 5-10 > This <+1>is <+1>a <+1>simple <+1>example. > > I find I actually prefer this over > > 5-10 > This <+1>is <+2>a <+3>simple <+4>example. Right, we just have to pick something. I'd like to get the basic structure down soon, though, as changing the timestamp parsing will be very difficult once there are implementations. >>>>> 7. Comments >>>> >>>>> we recommend the introduction of comments. >>>> >>>> I agree and think it needs to happen before WebVTT starts to get >>>> implemented >>>> and used on the web. In other words: now. >>> >>> Agreed. I'm happy for the previously suggested "//" at the line start >>> to be comments, or, for that matter, "#" or ";" or any other special >>> character. I would prefer not to use "/*" since it implies a "*/" is >>> required to end the comment. Similarly we should avoid "<!--" and >>> "-->" or anything else that requires a special comment end mark and >>> more than one or two characters. >> >> I'd quite like to have block comments, so I think the best options are: >> >> 1. // and /* */ like JavaScript >> 2. <!-- --> like HTML/XML > > If the main use case for the comments is to comment out a line, > something at the line start alone would be sufficient. If we have to > have both, I would prefer the shorter first option. > >> I think that the main difficulty is actually not picking a syntax, but >> deciding how it works in the parser. Unlike HTML, I don't think we want >> the >> comments to show up in the "DOM", since that would only work for >> intra-cue >> comments. Ideally it would be preprocessor-ish, but yet the magic bytes >> ("WEBVTT FILE") should be checked first as otherwise identifying WebVTT >> would require implementing its preprocessor steps :/ > > As I would not want the comments not to be handed into the DOM or to > JavaScript, it doesn't matter if they are not like HTML. I would > regard them more as pre-processor style comments. For simplicity, perhaps it would be better to have line-comments only. On my wishlist I have a less convoluted parser definition which operates on lines instead of sprinkling CR/LF all over, and it'd be easy to add line-comments to such a parser. Wish-list item requested at <http://www.w3.org/Bugs/Public/show_bug.cgi?id=12076>. >>>>> 8. Line wrapping >>>>> >>>>> CEA-708 captions support automatic line wrapping in a more >>>>> sophisticated way than WebVTT -- see >>>>> http://en.wikipedia.org/wiki/CEA-708#Word_wrap. >>>>> >>>>> In our experience with YouTube we have found that in certain >>>>> situations this type of automatic line wrapping is very useful. >>>>> Captions that were authored for display in a full-screen video may >>>>> contain too many words to be displayed fully within the actual video >>>>> presentation (note that mobile / desktop / internet TV devices may >>>>> each have a different amount of space available, and embedded videos >>>>> may be of arbitrary sizes). Furthermore, user-selected fonts or font >>>>> sizes may be larger than expected, especially for viewers who need >>>>> larger print. >>>>> >>>>> WebVTT as currently specified wraps text at the edge of their >>>>> containing blocks, regardless of the value of the 'white-space' >>>>> property, even if doing so requires splitting a word where there is >>>>> no >>>>> line breaking opportunity. This will tend to create poor quality >>>>> captions. For languages where it makes sense, line wrapping should >>>>> only be possible at carriage return, space, or hyphen characters, but >>>>> not on characters. (Note that CEA-708 also contains >>>>> non-breaking space and non-breaking transparent space characters to >>>>> help control wrapping.)However, this algorithm will not necessarily >>>>> work for all languages. >>>>> >>>>> We therefore suggest that a better solution for line wrapping would >>>>> be >>>>> to use the existing line wrapping algorithms of browsers, which are >>>>> presumably already language-sensitive. >>>>> >>>>> [Note: the YouTube line wrapping algorithm goes even further by >>>>> splitting single caption cues into multiple cues if there is too much >>>>> text to reasonably fit within the area. YouTube then adjusts the >>>>> times >>>>> of these caption cues so they appear sequentially. Perhaps this >>>>> could >>>>> be mentioned as another option for server-side tools.] >>>> >>>> Yeah, with SRT people are manually line-wrapping when authoring the >>>> captions >>>> and often enough the end result is that you get something rendered: >>>> >>>> - Who could have guessed that not all fonts are the same >>>> size? >>>> - That's news to me, so I get four lines of text where I >>>> wanted two! >>>> >>>> I'm inclined to say that we should normalize all whitespace during >>>> parsing >>>> and not have explicit line breaks at all. If people really want two >>>> lines, >>>> they should use two cues. In practice, I don't know how well that >>>> would >>>> fare, though. What other solutions are there? >>> >>> I don't think I would go that far. The concern has mostly been with >>> the line wrapping of lines that are too long and the possibility of >>> splitting words that way. The particular concern was with this >>> paragraph: >>> >>> "Text runs must be wrapped at the edge of their containing blocks, >>> regardless of the value of the 'white-space' property, even if doing >>> so requires splitting a word where there is no line breaking >>> opportunity." >>> see >>> http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#timed-text-tracks-0 >>> >>> So we want to avoid splitting mid-word and we suggest introducing the >>> ability to have non-breaking spaces. >> >> I think splitting in the middle of words would only happen for words >> that >> are longer than the whole line. > > Ah ok - I guess you can interpret the sentence above in this way as > in"splitting a word ONLY where there is no line breaking opportunity". > Then it's probably ok. It would still make sense to accept > non-breaking spaces. Perhaps Hixie would like to clarify in the spec precisely what is meant? There's already a non-breaking space in Unicode: NO-BREAK SPACE (U+00A0) >> There's still plenty of room for improvements in line wrapping, though. >> It >> seems to me that the main reason that people line wrap captions >> manually is >> to avoid getting two lines of very different length, as that looks quite >> unbalanced. There's no way to make that happen with CSS, and AFAIK it's >> not >> done by the WebVTT rendering spec either. > > People split manually when they want quality captions and can visually > test what it will look like. > > This endeavor has one big problem: when you change the video size, > e.g. go to full screen, your optimisation for the previous size is > likely to not be optimal for the new size any more. There, an > automatic line balancing that makes use of commas and "and"s for > choosing likely good line break positions would be nice. > > A completely different situation appears when the captions are not > manually created, as is the case in YouTube. Even when you submit a > perfect transcript and time-align it through speech recognition, you > will only do the line breaks as you have to render cues. To achieve a > better quality there, a better line-break algorithm would help > massively. > > So, I agree with you about improving the line wrapping. I also think > it is likely something that we have to leave to the browsers - at > least for now. Right, some experimentation here would be great, as I haven't seen any feature like this in any media players. In the hope of inspiring someone, perhaps myself, here's how I tentatively would like things to work: 1. Authors are encouraged to not manually line-break 2. UAs render the text at whatever with the <video> container allows, with margins and all 3. The text will have been rendered on n lines. 4. Decrease the width on the container as much as possible while having n lines. 5. Use that line-breaking and then do whatever left/center/right-alignment relative to the original width. I really should get around to reading the rendering section for WebVTT to see what it actually does, perhaps it's already clever... >>>>> 4. Addressing individual cues through CSS >>>>> >>>>> As far as we understand, you can currently address all cues through >>>>> ::cue and you can address a cue part through ::cue-part(<voice> || >>>>> <part> || <position> || <future-compatibility>). However, if we >>>>> understand correctly, it doesn?t seem to be possible to address an >>>>> individual cue through CSS, even though cues have individual >>>>> identifiers. This is either an oversight or a misunderstanding on our >>>>> parts. Can you please clarify how it is possible to address an >>>>> individual cue through CSS? >>>> >>>> Since I've been arguing against the id's in WebVTT, I'm curious about >>>> the >>>> use case here. Isn't using a unique class good enough? >>> >>> This links in with the discussion above on CSS styling and classes. >>> Rather than define classes of cue settings and reference them from the >>> cues, this allows them to be applied to individual cues in style >>> sheets. I thought the whole reason of cue identifiers was to have this >>> addressing functionality, so this would just close the loop. >>> >>> For example: >>> >>> Style sheet of the Web page: >>> <style> >>> video track#t1 ::cue(cue10) { >>> text-decoration: blink; >>> } >>> </style> >>> >>> The Web page (extract): >>> <video src="video.webm" controls> >>> <track id="t1" label="captions" kind="captions" srclang="en-US" >>> src="cap1.vtt"/> >>> </video> >>> >>> The caption file cap1.vtt: >>> WEBVTT >>> Language=en-US >>> Kind=Captions >>> >>> cue1 >>> 0.000-5.000 >>> blab blah >>> >>> cue10 >>> 40.000-60.000 >>> ALERT: Your basement is flooding - evacuate! >>> >>> >>> Cue10 is addressed through CSS and turned into a blinking text without >>> a need to change the markup at all. >> >> My point was that you could just as well do this: >> >> 0.000-5.000 >> <c.cue1>blab blah</c> >> >> In my view of things, id's in HTML are primarily for addressing via >> #fragments and as hooks for scripts, for styling class is quite >> sufficient, >> so I'm thinking it would be for WebVTT as well. > > I quite like the idea of using the identifiers for named media > fragment URIs: e.g. http://example.org/video.webm#cue10 . We need > identifiers for this. Also, I find them less intrusive in the text > than <c.cue1> which defines a class that is only every used on this > single cue. Hmm, isn't that what we have chapters for? Or do you want to use id's for some kind of inline chapters? >>>>> 5. Ability to move captions out of the way >>>>> >>>>> Our experience with automated caption creation and positioning on >>>>> YouTube indicates that it is almost impossible to always place the >>>>> captions out of the way of where a user may be interested to look at. >>>>> We therefore allow users to dynamically move the caption rendering >>>>> area to a different viewport position to reveal what is underneath. >>>>> We >>>>> recommend such drag-and-drop functionality also be made available for >>>>> TimedTrack captions on the Web, especially when no specific >>>>> positioning information is provided. >>>> >>>> This would indeed be rather nice, but wouldn't it interfere with text >>>> selection? Detaching the captions into a floating, draggable window >>>> via >>>> the >>>> context menu would be a theoretically possible solution, but that's >>>> getting >>>> rather far ahead of ourselves before we have basic captioning support. >>> >>> On YouTube you can only move them within the video viewport. You >>> should try it - it's really awesome actually. >>> >>> When you say "interfere with text selection" are you suggesting that >>> the text of captions/subtitles should be able to be cut and pasted? I >>> wonder what copyright holders think about that. >> >> Being able to select the captions just like any other text is a great >> thing >> that I wouldn't want to disable. It's very useful if you want to pause >> and >> look up the definition of a word or to report a typo in the captions >> without >> having to retype the whole text. > > I guess you can have all of that as you can have it on Web pages, too. > If you click and hold, it will be grabbing for moving. If you double > click it is text selection for cut and paste. So, I don't think there > would be a problem. That would work, but I have to admit I've never seen a web page/browser combination that does what you suggest. Just single clicking and dragging is certainly the most discoverable form of text selection. >> Premium Captions can be protected using the same tricks that are used to >> prevent Premium DOM Text Nodes from being copied. > > Agreed. -- Philip J?genstedt Core Developer Opera Software
Received on Tuesday, 15 February 2011 02:09:19 UTC