W3C home > Mailing lists > Public > www-archive@w3.org > June 2011

Re: Video feedback

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Fri, 3 Jun 2011 18:21:10 +1000
Message-ID: <BANLkTimAhNfK=brtmLX2Ay+XCmb6_uiZ7A@mail.gmail.com>
To: www-archive@w3.org
Seems this mail was not archived at
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-June/
Thus forwarding it for archiving.
Regards,
Silvia.

On Fri, Jun 3, 2011 at 9:28 AM, Ian Hickson <ian@hixie.ch> wrote:
>
> (Note that I have tried to only reply to each suggestion once, so
> subsequent requests for the same feature are not included below.)
>
> (I apologise for the somewhat disorganised state of this e-mail. I
> normally try to group topics together, but the threads I'm responding to
> here jumped back and forth across different issues quite haphazardly and
> trying to put related things together broke some of the flow and context
> of the discussions, so I opted in several places to leave the context as
> it was originally presented, and just jump back and forth amongst the
> topics raised. Hopefully it's not too confusing.)
>
> On Thu, 9 Dec 2010, Silvia Pfeiffer wrote:
>> >> > >
>> >> > > Sure, but this is only a snippet of an actual application. If,
>> >> > > e.g., you want to step through a list of videos (maybe an
>> >> > > automated playlist) using script and you need to provide at least
>> >> > > two different formats with <source>, you'd want to run this
>> >> > > algorithm frequently.
>> >> >
>> >> > Just have a bunch of <video>s in the markup, and when one ends,
>> >> > hide it and show the next one. Don't start dynamically manipulating
>> >> > <source> elements, that's just asking for pain.
>> >> >
>> >> > If you really must do it all using script, just use canPlayType and
>> >> > the <video src=""> attribute, don't mess around with <source>.
>> >>
>> >> Thanks for adding that advice. I think it's important to point that
>> >> out.
>> >
>> > I can add it to the spec too if you think that would help. Where would
>> > a good place for it be?
>>
>> There is a note in the <source> element section that reads as follows:
>> "Dynamically modifying a source element and its attribute when the
>> element is already inserted in a video or audio element will have no
>> effect. To change what is playing, either just use the src attribute on
>> the media element directly, or call the load() method on the media
>> element after manipulating the source elements."
>>
>> Maybe you can add some advice there to use canPlayType to identify what
>> type of resource to add in the @src attribute on the media element.
>> Also, you should remove the last half of the second sentence in this
>> note if that is not something we'd like to encourage.
>
> Done.
>
>
> On Wed, 8 Dec 2010, Kevin Marks wrote:
>>
>> One case where posters come back after playback is complete is when
>> there are multiple videos on the page, and only one has playback focus
>> at a time, such as a page of preview movies for longer ones to purchase.
>>
>> In that case, showing the poster again on blur makes sense conceptually.
>>
>> It seems that getting back into the pre-playback state, showing the
>> poster again would make sense in this context.
>>
>> That would imply adding an unload() method that reverted to that state,
>> and could be used to make any cached media data purgeable in favour of
>> another video that is subsequently loaded.
>
> You don't need unload(), you can just use load(). It essentially resets
> the media element.
>
> It's not hugely efficient, but if we find people are trying to do this a
> lot, then we can add a more efficent variant that just resets the poster
> frame state, I guess. (I'd probably call it stop(), though, not unload().)
>
>
> On Thu, 9 Dec 2010, David Singer wrote:
>>
>> I think if you want that effect, you flip what's visible in an area of
>> the page between a playing video, and an image.  Relying on the poster
>> is not effective, IMHO.
>
> I don't know, I think it would make semantic sense to have all the videos
> be <video> elements if they're actually going to be played right there.
>
>
> On Thu, 9 Dec 2010, Kevin Marks wrote:
>>
>> I know it's not effective at the moment; it is a common use case.
>> QuickTime had the 'badge' ux for years that hardly anyone took advantage
>> of:
>>
>> http://www.mactech.com/articles/mactech/Vol.16/16.02/Feb00QTToolkit/index.html
>>
>> What we're seeing on the web is a converged implementation of the
>> YouTube-like overlaid grey play button, but this is effectively
>> reimplemented independently by each video site that enables embedding.
>>
>> As we see HTML used declaratively for long-form works like ebooks on
>> lower performance devices, having embedded video that doesn't
>> cumulatively absorb all the memory available is going to be like the old
>> CD-ROM use cases the QT Badge was meant for.
>
> This seems like a presentational issue, for which CSS would be better
> positioned to provide a solution.
>
>
> On Thu, 9 Dec 2010, Boris Zbarsky wrote:
>> On 12/8/10 8:19 PM, Ian Hickson wrote:
>> > Boris wrote:
>> > > You can't sniff in a toplevel browser window.  Not the same way that
>> > > people are sniffing in <video>. It would break the web.
>> >
>> > How so?
>>
>> People actually rely on the not-sniffing behavior of UAs in actual
>> browser windows in some cases.  For example, application/octet-stream at
>> toplevel is somewhat commonly used to force downloads without a
>> corresponding Content-Disposition header (poor practice, but support for
>> Content-Disposition hasn't been historically great either).
>>
>> > (Note that the spec as it stands takes a compromise position: the
>> > content is only accepted if the Content-Type and type="" values are
>> > supported types (if present) and the content sniffs as a supported
>> > type, but nothing in the spec checks that all three values are the
>> > same.)
>>
>> Ah, I see.  So similar to the way <img> is handled...
>>
>> I can't quite decide whether this is the best of both worlds, or the
>> worst. ;)
>
> Yeah, I hear ya.
>
>
>> It certainly makes it simpler to implement video by delegating to
>> QuickTime or the like, though I suspect such an implementation would
>> also end up sniffing types the UA doesn't necessarily claim to
>> support.... so maybe it's not simpler after all.
>
> Indeed.
>
> At this point I'm basically just waiting to see what implementations end
> up doing. When I tried moving us more towards sniffing, there was
> pushback; when I tried moving us more towards honouring types, there was
> equal and opposite pushback. So at this point, I'm letting the market
> decide it. :-)
>
>
> On Thu, 9 Dec 2010, Simon Pieters wrote:
>> On Thu, 09 Dec 2010 02:58:12 +0100, Ian Hickson <ian@hixie.ch> wrote:
>> > On Wed, 1 Sep 2010, Simon Pieters wrote:
>> > >
>> > > I think it might be good to run the media element load algorithm
>> > > when setting or changing src on <source> (that has a media element
>> > > as its parent), but not type and media (what's the use case for type
>> > > and media?). However it would fire an 'emptied' event for each
>> > > <source> that changed, which is kind of undesirable. Maybe the media
>> > > element load algorithm should only be invoked if src is set or
>> > > changed on a <source> that has no previous sibling <source>
>> > > elements?
>> >
>> > What's the use case? Just set .src before you insert the element.
>>
>> The use case under discussion is changing to another video. So the
>> element is already inserted and already has src.
>>
>> Something like:
>>
>> <video controls autoplay>
>> <source src=video1.webm type=video/webm>
>> <source src=video1.mp4 type=video/mp4>
>> </video>
>> <script>
>> function loadVideo(src) {
>>  var video = document.getElementsByTagName('video')[0];
>>  sources = video.getElementsByTagName('source');
>>  sources[0].src = src + '.webm';
>>  sources[1].src = src + '.mp4';
>> }
>> </script>
>> <input type="button" value="See video 1" onclick="loadVideo('video1')">
>> <input type="button" value="See video 2" onclick="loadVideo('video2')">
>> <input type="button" value="See video 3" onclick="loadVideo('video3')">
>
> Well if you _really_ want to do that, just call video.load() at the end of
> loadVideo(). But really, you're better off poking around with
> canPlayType() and setting video.src directly instead of using <source>
> for these dynamic cases.
>
>
> On Thu, 9 Dec 2010, Kevin Carle wrote something more or less like:
>>
>> function loadVideo(src) {
>>  var video = document.getElementsByTagName('video')[0];
>>  if (video.canPlayType("video/webm") != "")
>>    video.src = src + '.webm';
>>  else
>>    video.src = src + '.mp4';
>> }
>
> Yeah.
>
> And hopefully this will become moot when there's a common video format,
> anyway.
>
>
> On Fri, 10 Dec 2010, Simon Pieters wrote:
>>
>> You'd need to remove the <source> elements to keep the document valid.
>
> You don't need them in the first place if you're doing things by script,
> as far as I can tell.
>
>
>> The author might want to have more than two <source>s, maybe with
>> media="", onerror="" etc. Then it becomes simpler to rely on the
>> resource selection algorithm.
>
> It's hard to comment without seeing a concrete use case.
>
>
> On Tue, 14 Dec 2010, Philip J盲genstedt wrote:
>> On Wed, 24 Nov 2010 17:11:02 +0100, Eric Winkelman <E.Winkelman@cablelabs.com>
>> wrote:
>> >
>> > I'm investigating how TimedTracks can be used for in-band-data-tracks
>> > within MPEG transport streams (used for cable television).
>> >
>> > In this format, the number and types of in-band-data-tracks can change
>> > over time.  So, for example, when the programming switches from a
>> > football game to a movie, an alternate language track may appear that
>> > wasn't there before. Later, when the programming changes again, that
>> > language track may be removed.
>> >
>> > It's not clear to me how these changes are exposed by the proposed
>> > Media Element events.
>>
>> The thinking is that you switch between different streams by setting the
>> src="" attribute to point to another stream, in which case you'll get an
>> emptied event along with another bunch of events. If you have a single
>> source where audio/video/text streams appear and disappear, there's not
>> really any way to handle it.
>
> As specified, there's no way for a media element's in-band text tracks to
> change after the 'loadedmetadata' event has fired.
>
>
>> > The "loadedmetadata" event is used to indicate that the TimedTracks
>> > are ready, but it appears that it is only fired before playback
>> > begins.  Is this event fired again whenever a new track is discovered?
>> > Is there another event that is intended for this situation?
>> >
>> > Similarly, is there an event that indicates when a track has been
>> > removed? Or is this also handled by the "loadedmetadata" event
>> > somehow?
>>
>> No, the loadedmetadata event is only fired once per resource, it's not
>> the event you're looking for.
>>
>> As for actual solutions, I think that having loadedmetadata fire again
>> if the number or type of streams change would make some sense.
>
> It would be helpful to know more about these cases where there are dynamic
> changes to the audio, video, or text tracks. Does this really happen on
> the Web? Do we need to handle it?
>
>
> On Thu, 16 Dec 2010, Silvia Pfeiffer wrote:
>>
>> I do not know how technically the change of stream composition works in
>> MPEG, but in Ogg we have to end a current stream and start a new one to
>> switch compositions. This has been called "sequential multiplexing" or
>> "chaining". In this case, stream setup information is repeated, which
>> would probably lead to creating a new steam handler and possibly a new
>> firing of "loadedmetadata". I am not sure how chaining is implemented in
>> browsers.
>
> Per spec, chaining isn't currently supported. The closest thing I can find
> in the spec to this situation is handling a non-fatal error, which causes
> the unexpected content to be ignored.
>
>
> On Fri, 17 Dec 2010, Eric Winkelman wrote:
>>
>> The short answer for changing stream composition is that there is a
>> Program Map Table (PMT) that is repeated every 100 milliseconds and
>> describes the content of the stream.  Depending on the programming, the
>> stream's composition could change entering/exiting every advertisement.
>
> If this is something that browser vendors want to support, I can specify
> how to handle it. Anyone?
>
>
> On Sat, 18 Dec 2010, Robert O'Callahan wrote:
>>
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#dom-media-duration says:
>> [...]
>>
>> What if the duration is not currently known?
>
> The user agent must determine the duration of the media resource before
> playing any part of the media data and before setting readyState to a
> value equal to or greater than HAVE_METADATA, even if doing so requires
> fetching multiple parts of the resource.
>
>
>> I think in general it will be very difficult for a user-agent to know
>> that a stream is unbounded. In Ogg or WebM a stream might not contain an
>> explicit duration but still eventually end. Maybe it would make more
>> sense for the last sentence to read "If the media resource is not known
>> to be bounded, ..."
>
> Done.
>
>
> On Sat, 18 Dec 2010, Philip J盲genstedt wrote:
>>
>> Agreed, this is how I've interpreted the spec already. If a server
>> replies with 200 OK instead of 206 Partial Content and the duration
>> isn't in the header of the resource, then the duration is reported to be
>> Infinity. If the resource eventually ends another durationchange event
>> is fired and the duration is reported to be the (now known) length of
>> the resource.
>
> That's fine.
>
>
> On Mon, 20 Dec 2010, Robert O'Callahan wrote:
>>
>> That sounds good to me. We'll probably do that. The spec will need to be
>> changed though.
>
> I changed it as you suggest above.
>
>
> On Fri, 31 Dec 2010, Bruce Lawson wrote:
>> > On Fri, 5 Nov 2010, Bruce Lawson wrote:
>> > >
>> > > http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#sourcing-in-band-timed-tracks
>> > > says to create TimedTrack objects etc for in-band tracks which are
>> > > then exposed in the API - so captions/subtitles etc that are
>> > > contained in the media container file are exposed, as well as those
>> > > tracks pointed to by the <track> element.
>> > >
>> > > But
>> > > http://www.whatwg.org/specs/web-apps/current-work/complete/video.html#timed-track-api
>> > > implies that the array is only of tracks in the track element:
>> > >
>> > > "media . tracks . length
>> > >
>> > > Returns the number of timed tracks associated with the media element
>> > > (e.g. from track elements). This is the number of timed tracks in
>> > > the media element's list of timed tracks."
>> >
>> > I don't understand why you interpret this as implying anything about
>> > the track element. Are you interpreting "e.g." as "i.e."?
>> >
>> > > Suggestion: amend to say "Returns the number of timed tracks
>> > > associated with the media element (e.g.  from track elements and any
>> > > in-band track files inside the media container file)" or some such.
>> >
>> > I'd rather avoid talking about the in-band ones here, in part because
>> > I think it's likely to confuse authors at least as much as help them,
>> > and in part because the terminology around in-band timed tracks is a
>> > little unclear to me and so I'd rather not talk about them in
>> > informative text. :-)
>> >
>> > If you disagree, though, let me know. I can find a way to make it
>> > work.
>>
>> I disagree, but not aggressively vehemently. My confusion was conflating
>> "track elements" with the three instances of the phrase "timed tracks"
>> in close proximity.
>>
>> I suggest that "Returns the number of timed tracks associated with the
>> media element (i.e. from track elements and any packaged along with the
>> media in its container file)" would be clearer and avoid use of the
>> confusing phrase "in-band tracks".
>
> That's still confusing, IMHO. "Packaged" doesn't imply in-band; most
> subtitle files are going to be "packaged" with the video even if they're
> out of band.
>
> Also, your 'i.e.' here is wrong. There's at least one other source of
> tracks: the ones added by the script.
>
> The non-normative text is intentionally not overly precise, because if it
> was precise it would just be the same as the normative text and wouldn't
> be any simpler, defeating its entire purpose.
>
>
> On Mon, 3 Jan 2011, Philip J盲genstedt wrote:
>> >
>> > + I've added a magic string that is required on the format to make it
>> >   recognisable in environments with no or unreliable type labeling.
>>
>> Is there a reason it's "WEBVTT FILE" instead of just "WEBVTT"? "FILE"
>> seems redundant and like unnecessary typing to me.
>
> It seemed more likely that non-WebVTT files would start with a line that
> said just "WEBVTT" than a line that said just "WEBVTT FILE". But I guess
> "WEBVTT FILE FORMAT" is just as likely and it'll be caught.
>
> I've changed it to just "WEBVTT"; there may be existing implementations
> that only accept "WEBVTT FILE" so for now I recommend that authors still
> use the longer header.
>
>
>> > On Wed, 8 Sep 2010, Philip J盲genstedt wrote:
>> > >
>> > > In the discussion on public-html-a11y <trackgroup> was suggested to
>> > > group together mutually exclusive tracks, so that enabling one
>> > > automatically disables the others in the same trackgroup.
>> > >
>> > > I guess it's up to the UA how to enable and disable <track>s now,
>> > > but the only option is making them all mutually exclusive (as
>> > > existing players do) or a weird kind of context menu where it's
>> > > possible to enable and disable tracks completely independently.
>> > > Neither options is great, but as a user I would almost certainly
>> > > prefer all tracks being mutually exclusive and requiring scripts to
>> > > enable several at once.
>> >
>> > It's not clear to me what the use case is for having multiple groups
>> > of mutually exclusive tracks.
>> >
>> > The intent of the spec as written was that a browser would by default
>> > just have a list of all the subtitle and caption tracks (the latter
>> > with suitable icons next to them, e.g. the [CC] icon in US locales),
>> > and the user would pick one (or none) from the list. One could easily
>> > imagine a UA allowing the user to enable multiple tracks by having the
>> > user ctrl-click a menu item, though, or some similar solution, much
>> > like with the commonly seen select box UI.
>>
>> In the vast majority of cases, all tracks are intended to be mutually
>> exclusive, such as English+English HoH or subtitles in different
>> languages. No media player UI (hardware or software) that I have ever
>> used allows enabling multiple tracks at once. Without any kind of hint
>> about which tracks make sense to enable together, I can't see desktop
>> Opera allowing multiple tracks (of the same kind) to be enabled via the
>> main UI.
>
> Personally I think it's quite reasonable to want to see two languages at
> once, or even two forms of the same language at once, especially for,
> e.g., reviewing subtitles. But I don't think it would be a bad thing if
> some browsers didn't expose that in the UI; that's something that could
> be left to bookmarklets, for example.
>
>
>> Using this syntax, I would expect some confusion when you omit the closing
>> </v>, when it's *not* a cue spoken by two voices at the same time, such as:
>>
>> <v Jim>- Boo!
>> <v Bob>- Gah!
>>
>> Gah! is spoken by both Jim and Bob, but that was likely not intended. If
>> this causes confusion, we should make validators warn about multiple
>> voices with with no closing </v>.
>
> No need to just warn, the spec says the above is outright invalid, so
> they would raise an error.
>
>
>> > > For captions and subtitles it's less common, but rendering it
>> > > underneath the video rather than on top of it is not uncommon, e.g.
>> > > http://nihseniorhealth.gov/video/promo_qt300.html or
>> >
>> > Conceptually, that's in the video area, it's just that the video isn't
>> > centered vertically. I suppose we could allow UAs to do that pretty
>> > easily, if it's commonly desired.
>>
>> It's already possible to align the video to the top of its content box
>> using <http://dev.w3.org/csswg/css3-images/#object-position>:
>>
>> video { object-position: center top }
>>
>> (This is already supported in Opera, but prefixed: -o-object-position)
>
> Sounds good.
>
>
>> Note that in Sweden captioning for the HoH is delivered via the teletext
>> system, which would allow ASCII-art to be displayed. Still, I've never
>> seen it. The only case of graphics being used in "subtitles" I can
>> remember ever seeing is the DVD of
>> <http://en.wikipedia.org/wiki/Cat_Soup>, where the subtitle system is
>> (ab)used to overlay some graphics.
>
> Yeah, I'm not at all concerned about not supporting graphics in subtitles.
> It's nowhere near the 80% bar.
>
>
>> If we ever want comments, we need to add support in the parser before
>> any content accidentally uses the syntax, in other words pretty soon
>> now.
>
> No, we can use any syntax that the parser currently ignores. It won't
> break backwards compat with content that already uses it by then, since
> the whole point of comments is to be ignored. The only difference is
> whether validators complain or not.
>
>
>> > On Tue, 14 Sep 2010, Anne van Kesteren wrote:
>> > >
>> > > Apart from text/plain I cannot think of a "web" text format that
>> > > does not have comments.
>> >
>> > But what's the use case? Is it really useful to have comments in a
>> > subtitle file?
>>
>> Being able to put licensing/contact information at the top of the file
>> would be useful, just as it is in JavaScript/CSS.
>
> Well the parser explicitly skips over anything in the header block
> (everything up to the first blank line IIRC), so if we find that people
> want this then we can allow it without having to change any UAs except the
> validators.
>
>
>> > On Fri, 22 Oct 2010, Simon Pieters wrote:
>> > > >
>> > > > It can still be inspired by it though so we don't have to change
>> > > > much. I'd be curious to hear what other things you'd clean up
>> > > > given the chance.
>> > >
>> > > WebSRT has a number of quirks to be compatible with SRT, like
>> > > supporting both comma and dot as decimal separators, the weird
>> > > parsing of timestamps, etc.
>> >
>> > I've cleaned the timestamp parsing up. I didn't see others.
>>
>> I consider the cue id line (the line preceding the timing line) to be
>> cruft carried over from SRT. When we now both have classes and the
>> possibility of getting a cue by index, so why do we need it?
>
> It's optional, but it is useful, especially for metadata tracks, as a way
> to grab specific cues. For example, consider a metadata or chapter track
> that contains cues with specific IDs that the site would use to jump to
> particular parts of the video in response to key presses, such as "start
> of content after intro", or maybe for a podcast with different segments,
> where the user can jump to "news" and "reviews" and "final thought" -- you
> need an ID to be able to find the right cue quickly.
>
>
>> > > There was also some discussion about metadata. Language is sometimes
>> > > necessary for the font engine to pick the right glyph.
>> >
>> > Could you elaborate on this? My assumption was that we'd just use CSS,
>> > which doesn't rely on language for this.
>>
>> It's not in any spec that I'm aware of, but some browsers (including
>> Opera) pick different glyphs depending on the language of the text,
>> which really helps when rendering CJK when you have several CJK fonts on
>> the system. Browsers will already know the language from <track
>> srclang>, so this would be for external players.
>
> How is this problem solved in SRT players today?
>
>
> On Mon, 14 Feb 2011, Philip J盲genstedt wrote:
>>
>> Given that most existing subtitle formats don't have any language
>> metadata, I'm a bit skeptical. However, if implementors of non-browser
>> players want to implement WebVTT and ask for this I won't stand in the
>> way (not that I could if I wanted to). For simplicity, I'd prefer the
>> language metadata from the file to not have any effect on browsers
>> though, even if no language is given on <track>.
>
> Indeed.
>
>
> On Tue, 4 Jan 2011, Alex Bishop wrote:
>>
>> Firefox too. If you visit
>> http://people.mozilla.org/~jdaggett/webfonts/serbianglyphs.html in
>> Firefox 4, the text explicitly marked-up as being Serbian Cyrillic
>> (using the lang="sr-Cyrl" attribute) uses some different glyphs to the
>> text with no language metadata.
>
> This seems to be in violation of CSS; we should probably fix it there
> before fixing it in WebVTT since WebVTT relis on CSS.
>
>
> On Mon, 3 Jan 2011, Philip J盲genstedt wrote:
>>
>> > > * The "bad cue" handling is stricter than it should be. After
>> > > collecting an id, the next line must be a timestamp line. Otherwise,
>> > > we skip everything until a blank line, so in the following the
>> > > parser would jump to "bad cue" on line "2" and skip the whole cue.
>> > >
>> > > 1
>> > > 2
>> > > 00:00:00.000 --> 00:00:01.000
>> > > Bla
>> > >
>> > > This doesn't match what most existing SRT parsers do, as they simply
>> > > look for timing lines and ignore everything else. If we really need
>> > > to collect the id instead of ignoring it like everyone else, this
>> > > should be more robust, so that a valid timing line always begins a
>> > > new cue. Personally, I'd prefer if it is simply ignored and that we
>> > > use some form of in-cue markup for styling hooks.
>> >
>> > The IDs are useful for referencing cues from script, so I haven't
>> > removed them. I've also left the parsing as is for when neither the
>> > first nor second line is a timing line, since that gives us a lot of
>> > headroom for future extensions (we can do anything so long as the
>> > second line doesn't start with a timestamp and "-->" and another
>> > timestamp).
>>
>> In the case of feeding future extensions to current parsers, it's way
>> better fallback behavior to simply ignore the unrecognized second line
>> than to discard the entire cue. The current behavior seems unnecessarily
>> strict and makes the parser more complicated than it needs to be. My
>> preference is just ignore anything preceding the timing line, but even
>> if we must have IDs it can still be made simpler and more robust than
>> what is currently spec'ed.
>
> If we just ignore content until we hit a line that happens to look like a
> timing line, then we are much more constrained in what we can do in the
> future. For example, we couldn't introduce a "comment block" syntax, since
> any comment containing a timing line wouldn't be ignored. On the other
> hand if we keep the syntax as it is now, we can introduce a comment block
> just by having its first line include a "-->" but not have it match the
> timestamp syntax, e.g. by having it be "--> COMMENT" or some such.
>
> Looking at the parser more closely, I don't really see how doing anything
> more complex than skipping the block entirely would be simpler than what
> we have now, anyway.
>
>
> On Mon, 3 Jan 2011, Glenn Maynard wrote:
>>
>> By the way, the WebSRT hit from Google
>> (http://www.whatwg.org/specs/web-apps/current-work/websrt.html) is 404.
>> I've had to read it out of the Google cache, since I'm not sure where it
>> went.
>
> I added a redirect.
>
>
>> Inline comments (not just line comments) in subtitles are very important
>> for collaborative editing: for leaving notes about a translation, noting
>> where editing is needed or why a change was made, and so on.
>>
>> If a DOM-like interface is specified for this (presumably this will
>> happen later), being able to access inline comments like DOM comment
>> nodes would be very useful for visual editors, to allow displaying
>> comments and to support features like "seek to next comment".
>
> We can add comments pretty easily (e.g. we could say that "<!" starts a
> comment and ">" ends it -- that's already being ignored by the current
> parser), if people really need them. But are comments really that useful?
> Did SRT have problem due to not supporting inline comments? (Or did it
> support inline comments?)
>
>
> On Tue, 4 Jan 2011, Glenn Maynard wrote:
>> On Tue, Jan 4, 2011 at 4:24 AM, Philip J盲genstedt <philipj@opera.com>
>> wrote:
>> > If you need an intermediary format while editing, you can just use any
>> > syntax you like and have the editor treat it specially.
>>
>> If I'd need to write my own parser to write an editor for it, that's one
>> thing--but I hope I wouldn't need to create yet another ad hoc caption
>> format, mirroring the features of this one, just to work around a lack
>> of inline comments.
>
> An editor would need a custom parser anyway to make sure it round-tripped
> syntax errors, presumably.
>
>
>> The cue text already vaguely resembles HTML.  What about <!-- comments
>> -->?  It's universally understood, and doesn't require any new escape
>> mechanisms.
>
> The current parser would end a comment at the first ">", but so long as
> you didn't have a ">" in the comment, "<!--...-->" would work fine within
> cue text. (We would have to be careful in standalone blocks to define it
> in such a way that it could not be confused with a timing line.)
>
>
> On Wed, 5 Jan 2011, Philip J盲genstedt wrote:
>>
>> The question is rather if the comments should be exposed as DOM comment
>> nodes in getCueAsHTML, which seems to be what you're asking for. That
>> would only be possible if comments were only allowed inside the cue
>> text, which means that you couldn't comment out entire cues, as such:
>>
>> 00:00.000 --> 00:01.000
>> one
>>
>> /*
>> 00:02.000 --> 00:03.000
>> two
>> */
>>
>> 00:04.000 --> 00:05.000
>> three
>>
>> Therefore, my thinking is that comments should be removed during parsing
>> and not be exposed to any layer above it.
>
> We can support both, if there's really demand for it.
>
> For example:
>
>  00:00.000 --> 00:01.000
>  one <! inline comment > one
>
>  COMMENT-->
>  00:02.000 --> 00:03.000
>  two; this is entirely
>  commented out
>
>  <! this is the ID line
>  00:04.000 --> 00:05.000
>  three; last line is a ">"
>  which is part of the cue
>  and is not a comment.
>  >
>
> The above would work today in a conforming UA. The question really is what
> parts of this do we want to support and what do we not care enough about.
>
>
> On Wed, 5 Jan 2011, Anne van Kesteren wrote:
>> On Wed, 05 Jan 2011 10:58:56 +0100, Philip J盲genstedt
>> <philipj@opera.com> wrote:
>> > Therefore, my thinking is that comments should be removed during
>> > parsing and not be exposed to any layer above it.
>>
>> CSS does that too. It has not caused problems so far. It does mean
>> editing tools need a slightly different DOM, but that is always the case
>> as they want to preserve whitespace details, etc., too. At least editors
>> that have both a text and visual interface.
>
> Right.
>
>
> On Fri, 14 Jan 2011, Silvia Pfeiffer wrote:
>>
>> We are concerned, however, about the introduction of WebVTT as a
>> universal captioning format *when used outside browsers*. Since a subset
>> of CSS features is required to bring HTML5 video captions on par with TV
>> captions, non-browser applications will need to support these CSS
>> features, too. However, we do not believe that external CSS files are an
>> acceptable solution for non-browser captioning and therefore think that
>> those CSS features (see [1]) should eventually be made part of the
>> WebVTT specification.
>>
>> [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#the-'::cue'-pseudo-element
>
> I'm not sure what you mean by "made part of the WebVTT specification", but
> if you mean that WebVTT should support inline CSS, that does seem line
> something we can add, e.g. using syntax like this:
>
>   WEBVTT
>
>   STYLE-->
>   ::cue(v[voice=Bob]) { color: green; }
>   ::cue(c.narration) { font-style: italic; }
>   ::cue(c.narration i) { font-style: normal; }
>
>   00:00.000 --> 00:02.000
>   Welcome.
>
>   00:02.500 --> 00:05.000
>   To WebVTT.
>
> I suggest we wait until WebVTT and '::cue' in particular have shipped in
> at least one browser and been demonstrated as being useful before adding
> this kind of feature though.
>
>
>> 1. Introduce file-wide metadata
>>
>> WebVTT requires a structure to add header-style metadata. We are here
>> talking about lists of name-value pairs as typically in use for header
>> information. The metadata can be optional, but we need a defined means
>> of adding them.
>>
>> Required attributes in WebVTT files should be the main language in use
>> and the kind of data found in the WebVTT file - information that is
>> currently provided in the <track> element by the @srclang and @kind
>> attributes. These are necessary to allow the files to be interpreted
>> correctly by non-browser applications, for transcoding or to determine
>> if a file was created as a caption file or something else, in particular
>> the @kind=metadata. @srclang also sets the base directionality for BiDi
>> calculations.
>>
>> Further metadata fields that are typically used by authors to keep
>> specific authoring information or usage hints are necessary, too. As
>> examples of current use see the format of MPlayer mpsub’s header
>> metadata [2], EBU STL’s General Subtitle Information block [3], and
>> even CEA-608’s Extended Data Service with its StartDate, Station,
>> Program, Category and TVRating information [4]. Rather than specifying a
>> specific subset of potential fields we recommend to just have the means
>> to provide name-value pairs and leave it to the negotiation between the
>> author and the publisher which fields they expect of each other.
>>
>> [2] http://www.mplayerhq.hu/DOCS/tech/mpsub.sub
>> [3] https://docs.google.com/viewer?a=v&q=cache:UKnzJubrIh8J:tech.ebu.ch/docs/tech/tech3264.pdf
>> [4] http://edocket.access.gpo.gov/cfr_2007/octqtr/pdf/47cfr15.119.pdf
>
> I don't understand the use cases here.
>
> CSS and JS don't have anything like this, why should WebVTT? What problem
> is this solving? How did SRT solve this problem?
>
>
>> 2. Introduce file-wide cue settings
>>
>> At the moment if authors want to change the default display of cues,
>> they can only set them per cue (with the D:, S:, L:, A: and T:. cue
>> settings) or have to use an external CSS file through a HTML page with
>> the ::cue pseudo-element. In particular when considering that all
>> Asian language files would require a “D:vertical” marker, it becomes
>> obvious that this replication of information in every cue is
>> inefficient and a waste of bandwidth, storage, and application speed.
>> A cue setting default section should be introduced into a file
>> header/setup area of WebVTT which will avoid such replication.
>>
>> An example document with cue setting defaults in the header could look
>> as follows:
>> ==
>> WEBVTT
>> Language=zh
>> Kind=Caption
>> CueSettings= A:end D:vertical
>>
>> 00:00:15.000 --> 00:00:17.950
>> 在左边我们可以看到...
>>
>> 00:00:18.160 --> 00:00:20.080
>> 在右边我们可以看到...
>>
>> 00:00:20.110 --> 00:00:21.960
>> ...捕蝇草械.
>> ==
>>
>> Note that you might consider that the solution to this problem is to use
>> external CSS to specify a change to all cues. However, this is not
>> acceptable for non-browser applications and therefore not an acceptable
>> solution to this problem.
>
> Adding defaults seems like a reasonable feature. We could add this just by
> adding the ability to have a block in a VTT file like this:
>
>   WEBVTT
>
>   DEFAULTS --> A:vertical A:end
>
>   00:00.000 --> 00:02.000
>   This is vertical and end-aligned.
>
>   00:02.500 --> 00:05.000
>   As is this.
>
>   DEFAULTS --> A:start
>
>   00:05.500 --> 00:07.000
>   This is horizontal and start-aligned.
>
> However, again I suggest that we wait until WebVTT has been deployed in at
> least one browser before adding more features like this.
>
>
>> * positioning: Generally the way in which we need positioning to work is
>> to provide an anchor position for the text and then explain in which
>> direction font size changes and the addition of more text allows the
>> text segment to grow. It seems that the line position cue (L) provides a
>> baseline position and the alignment cue (A) provides the growing
>> direction start/middle/end. Can we just confirm this understanding?
>
> It's more the other way around: the line boxes are laid out and then the
> resulting line boxes are positioned according to the A: and L: lines. In
> particular, the L: lines when given with a % character position the line
> boxes in the same manner that CSS background-position positions the
> background image, and L: lines without a % character set the position of
> the line boxes based on the height of the first line box. A: lines then
> just set the position of these line boxes relative to the other dimension.
>
>
>> * fontsize: When changing text size in relation to the video changing
>> size or resolution, we need to make sure not to reduce the text size
>> below a specific font size for readability reasons. And we also need to
>> make sure not to make it larger than a specific font size, since
>> otherwise it will dominate the display. We usually want the text to be
>> at least Xpx, but no bigger than Ypx. Also, one needs to pay attention
>> to the effect that significant player size changes have on relative
>> positioning - in particular for the minimum caption text size. Dealing
>> with min and max sizes is missing from the current specification in our
>> understanding.
>
> That's a CSS implementation issue. Minimum font sizes are commonly
> supported in CSS implementations. Maximum font sizes would be similar.
>
>
>> * bidi text: In our experience from YouTube, we regularly see captions
>> that contain mixed languages/directionality, such as Hebrew captions
>> that have a word of English in it. How do we allow for bidi text inside
>> cues? How do we change directionality mid-cue? Do we deal with the
>> zero-width LTR-mark and RTL-mark unicode characters? It would be good to
>> explain how these issues are dealt with in WebVTT.
>
> There's nothing special about how they work in WebVTT; they are handled
> the same as in CSS.
>
>
>> * internationalisation: D:vertical and D:vertical-lr seem to only work
>> for vertical text - how about horizontal-rl? For example, Hebrew is a
>> prime example of a language being written from right to left
>> horizontally. Is that supported and how?
>
> What exactly would horizontal-rl do?
>
>
>> * naming: The usage of single letter abbreviations for cue settings has
>> created quite a discussion here at Google. We all agree that file-wide
>> cue settings are required and that this will reduce the need for
>> cue-specific cue settings. We can thus afford a bit more readability in
>> the cue settings. We therefore believe that it would be better if the
>> cue settings were short names rather than single letter codes. This
>> would be more like CSS, too, and easier to learn and get right. In the
>> interface description, the 5 dimensions have proper names which could be
>> re-used (“direction”, “linePosition”, “textPosition”, “size” and
>> “align"). We therefore recommend replacing the single-letter cue
>> commands with these longer names.
>
> That would massively bloat these files and make editing them a huge pain,
> as far as I can tell. I agree that defaults would make it better, but many
> cues would still need their own positioning and sizing information, and
> anything beyond a very few letters would IMHO quickly become far too
> verbose for most people. "L", "A", and "S" are pretty mnemonic, "T" would
> quickly become familiar to people writing cues, and "D" is only going to
> be relevant to some authors but for those authors it's pretty
> self-explanatory as well, since the value is verbose.
>
> What I really would like to do is use "X" and "Y" instead of "T" and "L",
> but those terms would be very confusing when we flip the direction, which
> is why I used the less obvious "T" and "L".
>
>
>> * textcolor: In particular on European TV it is common to distinguish
>> between speakers by giving their speech different colors. The following
>> colors are supported by EBU STL, CEA-608 and CEA-708 and should be
>> supported in WebVTT without the use of external CSS: black, red, green,
>> yellow, blue, magenta, cyan, and white. As default we recommend white on
>> a grey transparent background.
>
> This is supported as 'color' and 'background'.
>
>
>> * underline: EBU STL, CEA-608 and CEA-708 support underlining of
>> characters.
>
> I've added support for 'text-decoration'.
>
>
>> The underline character is also particularly important for some Asian
>> languages.
>
> Could you elaborate on this?
>
>
>> Please make it possible to provide text underlines without the use of
>> CSS in WebVTT.
>
> Why without CSS?
>
>
>> * blink: As much as we would like to discourage blinking subtitles, they
>> are actually a core requirement for EBU STL and CEA-608/708 captions and
>> in use in particular for emergency messages and similar highly important
>> information. Blinking can be considered optional for implementation, but
>> we should allow for it in the standard.
>
> This is part of 'text-decoration'.
>
>
>> * font face: CEA-708 provides a choice of eight font tags: undefined,
>> monospaced serif, proportional serif, monospaced sans serif,
>> proportional sans serif, casual, cursive, small capital. These fonts
>> should be available for WebVTT as well. Is this the case?
>
> Yes.
>
>
>> We are not sure about the best solution to these needs. Would it be best
>> to introduce specific tags for these needs?
>
> CSS seems to handle these needs adequately.
>
>
>> We have a couple of recommendations for changes mostly for aesthetic and
>> efficiency reasons. We would like to point out that Google is very
>> concerned with the dense specification of data and every surplus
>> character, in particular if it is repeated a lot and doesn’t fulfill a
>> need, should be removed to reduce the load created on worldwide
>> networking and storage infrastructures and help render Web pages faster.
>
> This seems to contradict your earlier request to make the languge more
> verbose...
>
>
>> * Time markers: WebVTT time stamps follow no existing standard for time
>> markers. Has the use of NPT as introduced by RTSP[5] for time markers
>> been considered (in particular npt-hhmmss)?
>>
>> [5] http://www.ietf.org/rfc/rfc2326.txt
>
> WebVTT follows the SRT format, with commas replaced by periods for
> consistency with the rest of the platform.
>
>
>> * Suggest dropping “-->”: In the context of HTML, “-->” is an end
>> comment marker. It may confuse Web developers and parsers if such a sign
>> is used as a separator. For example, some translation tools expect HTML
>> or XML-based interchange formats and interpret the “>” as part of a
>> tag. Also, common caption convention often uses “>” to represent
>> speaker identification. Thus it is more difficult to write a filter
>> which correctly escapes “-->” but retains “>” for speaker ID.
>
> "-->" seems pretty mnemonic to me. I don't see why we'd want to drop it.
>
>
>> * Duration specification: WebVTT time stamps are always absolute time
>> stamps calculated in relation to the base time of synchronisation with
>> the media resource. While this is simple to deal with for machines, it
>> is much easier for hand-created captions to deal with relative time
>> stamps for cue end times and for the timestamp markers within cues. Cue
>> start times should continue to stay absolute time stamps. Timestamp
>> markers within cues should be relative to the cue start time. Cue end
>> times should be possible to be specified either as absolute or relative
>> timestamps. The relative time stamps could be specified through a prefix
>> of “+” in front of a “ss.mmm” second and millisecond specification.
>> These are not only simpler to read and author, but are also more compact
>> and therefore create smaller files.
>
> I think if anything is absolute, it doesn't really make anything much
> simpler for anything else to be relative, to be honest. Take the example
> you give here:
>
>> An example document with relative timestamps is:
>> ==
>> WEBVTT
>> Language=en
>> Kind=Subtitle
>>
>> 00:00:15.000   +2.950
>> At the left we can see...
>>
>> 00:00:18.160    +1.920
>> At the right we can see the...
>>
>> 00:00:20.110   +1.850
>> ...the <+0.400>head-<+0.800>snarlers
>> ==
>
> If the author were to change the first time stamp because the video gained
> a 30 second advertisement at the start, then he would still need to change
> the hundreds of subseqent timestamps for all the additional cues. What
> does the author gain from not having to change the relative stamps? It's
> not like he's going to be doing it by hand, and once a tool is involved,
> the tool can change everything just as easily.
>
>
>> We are happy to see the introduction of the magic file identifier for
>> WebVTT which will make it easier to identify the file format. We do not
>> believe the “FILE” part of the string is necessary.
>
> I have removed it.
>
>
>> However, we recommend to also introduce a format version number that the
>> file adheres to, e.g. “WEBVTT 0.7”.
>
> Version numbers are an antipattern on the Web, so I have not added one.
>
>
>> This helps to make non-browser systems that parse such files become
>> aware of format changes.
>
> The format will never change in a non-backwards-compatible fashion once it
> is deployed, so that is not a concern.
>
>
>> It can also help identify proprietary standard metadata sets as used by
>> a specific company, such as “WEBVTT 0.7 ABC-meta1” which could signify
>> that the file adheres to WEBVTT 0.7 format specification with the
>> ABC-meta1 metadata schema.
>
> If we add metadata, then that can be handled just by having the metadata
> include that information itself.
>
>
>> CEA-708 captions support automatic line wrapping in a more sophisticated
>> way than WebVTT -- see http://en.wikipedia.org/wiki/CEA-708#Word_wrap.
>>
>> In our experience with YouTube we have found that in certain situations
>> this type of automatic line wrapping is very useful. Captions that were
>> authored for display in a full-screen video may contain too many words
>> to be displayed fully within the actual video presentation (note that
>> mobile / desktop / internet TV devices may each have a different amount
>> of space available, and embedded videos may be of arbitrary sizes).
>> Furthermore, user-selected fonts or font sizes may be larger than
>> expected, especially for viewers who need larger print.
>>
>> WebVTT as currently specified wraps text at the edge of their containing
>> blocks, regardless of the value of the 'white-space' property, even if
>> doing so requires splitting a word where there is no line breaking
>> opportunity. This will tend to create poor quality captions.  For
>> languages where it makes sense, line wrapping should only be possible at
>> carriage return, space, or hyphen characters, but not on &nbsp;
>> characters.  (Note that CEA-708 also contains non-breaking space and
>> non-breaking transparent space characters to help control wrapping.)
>> However, this algorithm will not necessarily work for all languages.
>>
>> We therefore suggest that a better solution for line wrapping would be
>> to use the existing line wrapping algorithms of browsers, which are
>> presumably already language-sensitive.
>>
>> [Note: the YouTube line wrapping algorithm goes even further by
>> splitting single caption cues into multiple cues if there is too much
>> text to reasonably fit within the area. YouTube then adjusts the times
>> of these caption cues so they appear sequentially.  Perhaps this could
>> be mentioned as another option for server-side tools.]
>
> I've adjusted the text in the spec to more clearly require that
> line-breaking follow normal CSS rules but with the additional requirement
> that there not be overflow, which is what I had intended.
>
>
>> 1. Pop-on/paint-on/roll-up support
>>
>> Three different types of captions are common on TV: pop-on, roll-up and
>> paint-on. Captions according to CEA-608/708 need to support captions of
>> all three of these types. We believe they are already supported in
>> WebVTT, but see a need to re-confirm.
>>
>> For pop-on captions, a complete caption cue is timed to appear at a
>> certain time and disappear a few seconds later. This is the typical way
>> in which captions are presented and also how WebVTT/<track> works in our
>> understanding. Is this correct?
>
> As far as I understand, yes.
>
>
>> For roll-up captions, individual lines of captions are presented
>> successively with older lines moving up a line to make space for new
>> lines underneath. Assuming we understand the WebVTT rendering rules
>> correctly, WebVTT would identify each of these lines as an individual,
>> but time-overlapping cue with the other cues. As more cues are created
>> and overlap in time, newer cues are added below the currently visible
>> ones and move the currently visible ones up, basically creating a
>> roll-up effect. If this is a correct understanding, then this is an
>> acceptable means of supporting roll-up captions.
>
> I am not aware of anything currently in the WebVTT specification which
> will cause a cue to move after it has been placed on the video, so I do
> not believe this is a correct understanding.
>
> However, you can always have a cue be replaced by a cue with the same text
> but on a higher line, if you're willing to do some preprocessing on the
> subtitle file. It won't be a smoothly animated scroll, but it would work.
>
> If there is convincing evidence that this kind of subtitle is used on the
> Web, though, we can support it more natively. So far I've only seen it in
> legacy scenarios that do not really map to expected WebVTT use cases.
>
> For supporting those legacy scenarios, you need script anyway (to handle,
> e.g., backspace and moving the cursor). If you have script, doing
> scrolling is possible either by moving the cue, or by not using the
> default UA rendering of the cues at all and doing it manually (e.g. using
> <div>s or <canvas>).
>
>
>> Finally, for paint-on captions, individual letters or words are
>> displayed successively on screen. WebVTT supports this functionality
>> with the cue timestamps <xx:xx:xx.xxx>, which allows to specify
>> characters or words to appear with a delay from within a cue. This
>> essentially realizes paint-on captions. Is this correct?
>
> Yes.
>
>
>> (Note that we suggest using relative timestamps inside cues to make this
>> feature more usable.)
>
> It makes it modestly easier to do by hand, but hand-authoring a "paint-on"
> style caption seems like a world of pain regardless of the timestamp
> format we end up using, so I'm not sure it's a good argument for
> complicating the syntax with a second timestamp format.
>
>
>> The HTML spec specifies that it is not allowed to have two tracks that
>> provide the same kind of data for the same language (potentially empty)
>> and for the same label (potentially empty). However, we need
>> clarification on what happens if there is a duplicate track, ie: does
>> the most recent one win or the first one or will both be made available
>> in the UI and JavaScript?
>
> They are both available.
>
>
>> The spec only states that the combination of {kind, type, label} must be
>> unique. It doesn't say what happens if they are not.
>
> Nothing different happens if they are not than if they are. It's just a
> conformance requirement.
>
>
>> Further, the spec says nothing about duplicate labels altogether - what
>> is a browser supposed to do when two tracks have been marked with the
>> same label?
>
> That same as it does if they have different labels.
>
>
>> It is very important that there is a possibility for users to
>> auto-activate tracks. Which track is chosen as the default track to
>> activate depends on the language preferences of the user. The user is
>> assumed to have a list of language preferences which leads this choice.
>
> I've added a "default" attribute so that sites can control this.
>
>
>> In YouTube, if any tracks exist that match the first language
>> preference, the first of those is used as the default.  A track with
>> no name sorts ahead of one with a name.  The sorting is done according
>> to that language's collation order. In order to override this you
>> would need (1) a default=true attribute for a track which gives it
>> precedence if its language matches, and (2) a way to force the
>> language preference. If no tracks exist for the first language pref,
>> the second language pref is checked, and so on.
>>
>> If the user's language preferences are known, and there are no tracks
>> in that language, you have other options:
>>   (1) offer to do auto-translation (or just do it)
>>   (2) use a track in the same language that the video's audio is in (if known)
>>   (3) if only one track, use the first available track
>>
>> Also make sure the language choice can be overriden by the user
>> through interaction.
>>
>> We’d like to make sure this or a similar algorithm is the recommended
>> way in which browsers deal with caption tracks.
>
> This seems to me to be a user agent quality of implementation issue. User
> preferences almost by definition can't be interoperable, so it's not
> something we can specify.
>
>
>> As far as we understand, you can currently address all cues through
>> ::cue and you can address a cue part through ::cue-part(<voice> ||
>> <part> || <position> || <future-compatibility>). However, if we
>> understand correctly, it doesn’t seem to be possible to address an
>> individual cue through CSS, even though cues have individual
>> identifiers. This is either an oversight or a misunderstanding on our
>> parts. Can you please clarify how it is possible to address an
>> individual cue through CSS?
>
> I've made the ID referencable from the ::cue() selector argument as an ID
> on the anonymous root element.
>
>
>> Our experience with automated caption creation and positioning on
>> YouTube indicates that it is almost impossible to always place the
>> captions out of the way of where a user may be interested to look at. We
>> therefore allow users to dynamically move the caption rendering area to
>> a different viewport position to reveal what is underneath. We recommend
>> such drag-and-drop functionality also be made available for TimedTrack
>> captions on the Web, especially when no specific positioning information
>> is provided.
>
> I've added text to explicitly allow this.
>
>
> On Sat, 22 Jan 2011, Philip J盲genstedt wrote:
>>
>> Indeed, repeating settings on each cue would be annoying. However,
>> file-wide settings seems like it would easily be too broad, and you'd
>> have to explicitly reverse the effect on the cues where you don't want
>> it to apply. Maybe classes of cue settings or some kind of macros would
>> work better.
>
> My assumption is that similar cues will typically be grouped together, so
> that one could introduce the group with a "DEFAULTS" block and then
>
>
>> Nitpick: Modern Chinese, including captions, is written left-to-right,
>> top-to-bottom, just like English.
>
> Indeed. I don't expect there will be much vertical text captioning. I
> added it primarily to support some esoteric Anime cases.
>
>
>
>> That the intra-cue timings are relative but the timing lines are
>> absolute has bugged me a bit, so if the distinction was more obvious
>> just from the syntax, that'd be great!
>
> They're all absolute.
>
>
>> [for the file signature] "WebSRT" is prettier than "WEBSRT".
>
> The idea is not to be pretty, the idea is to stand out. :-)
>
>
>> I'm inclined to say that we should normalize all whitespace during
>> parsing and not have explicit line breaks at all. If people really want
>> two lines, they should use two cues. In practice, I don't know how well
>> that would fare, though. What other solutions are there?
>
> I think we definitely need line breaks, e.g. for cases like:
>
>  -- Do you want to go to the zoo?
>  -- Yes!
>  -- Then put your shoes on!
>
> ...which is quite common style in some locales.
>
> However, I agree that we should encourage people to let browsers wrap the
> lines. Not sure how to encourage that more.
>
>
> On Sun, 23 Jan 2011, Glenn Maynard wrote:
>>
>> It should be possible to specify language per-cue, or better, per block
>> of text mid-cue.  Subtitles making use of multiple languages are common,
>> and it should be possible to apply proper font selection and word
>> wrapping to all languages in use, not just the primary language.
>
> It's not clear to me that we need language information to apply proper
> font selection and word wrapping, since CSS doesn't do it.
>
>
>> When both English subtitles and Japanese captions are on screen, it
>> would be very bad to choose a Chinese font for the Japanese text, and
>> worse to choose a Western font and use it for everything, even if
>> English is the predominant language in the file.
>
> Can't you get around this using explicit styles, e.g. against classes?
> Unless this really is going to be a common problem, I'm not particularly
> concerned about it.
>
>
> On Mon, 24 Jan 2011, Philip J盲genstedt wrote:
>>
>> Multi-languaged subtitles/captions seem to be extremely uncommon,
>> unsurprisingly, since you have to understand all the languages to be
>> able to read them.
>>
>> The case you mention isn't a problem, you just specify Japanese as the
>> main language.
>
> Indeed.
>
>
>> There are a few other theoretical cases:
>>
>> * Multi-language CJK captions. I've never seen this, but outside of
>> captioning, it seems like the foreign script is usually transcribed to
>> the native script (e.g. writing Japanese names with simplified Chinese
>> characters).
>>
>> * Use of Japanese or Chinese words in a mostly non-CJK subtitles. This
>> would make correct glyph selection impossible, but I've never seen it.
>>
>> * Voice synthesis of e.g. mixed English/French captions. Given that this
>> would only be useful to be people who know both languages, it seem not
>> worth complicating the format for.
>
> Agreed on all fronts.
>
>
>> Do you have any examples of real-world subtitles/captions that would
>> benefit from more fine-grained language information?
>
> This kind of information would indeed be useful.
>
>
> On Mon, 24 Jan 2011, Glenn Maynard wrote:
>>
>> They're very common in anime fansubs:
>>
>> http://img339.imageshack.us/img339/2681/screenshotgg.jpg
>>
>> The text on the left is a transcription, the top is a transliteration,
>> and the bottom is a translation.
>
> Aren't these three separate text tracks?
>
>
>> I'm pretty sure I've also seen cases of translation notes mixing
>> languages within the same caption, eg. "jinja (绁炵ぞ): shrine", but
>> it's less common and I don't have an example handy.
>
> Mixing one CJK language with one non-CJK language seems fine. That should
> always work, assuming you specify good fonts in the CSS.
>
>
>> > The case you mention isn't a problem, you just specify Japanese as the
>> > main language. There are a few other theoretical cases:
>>
>> Then you're indicating that English text is Japanese, which I'd expect
>> to cause UAs to render everything with a Japanese font.  That's what
>> happens when I load English text in Firefox and force SJIS: everything
>> is rendered in MS PGothic.  That's probably just what Japanese users
>> want for English text mixed in with Japanese text, too--but it's
>> generally not what English users want with the reverse.
>
> I don't understand why we can't have good typography for CJK and non-CJK
> together. Surely there are fonts that get both right?
>
>
> On Mon, 24 Jan 2011, Glenn Maynard wrote:
>> >
>> > [ use multiple tracks ]
>>
>> Personally I'd prefer that, but it would require a good deal of metadata
>> support--marking which tracks are meant to be used together, tagging
>> auxilliary track types so browsers can choose (eg. an "English subtitles
>> with no song caption tracks" option), and so on.  I'm sure that's a
>> non-starter (and I'd agree).
>
> It's not that much metadata. It's far less effort than making the
> subtitles in the first place.
>
>
>> I don't think you should need to resort to fine-grained font control to get
>> reasonable default fonts.
>
> I agree entirely, but I don't think you should need to resort to
> fine-grained language tagging either...
>
>
>> The above--semantics vs. presentation--brings something else to mind.
>> One of the harder things to subtitle well is when you have two
>> conversations talking on top of each other.  This is generally done by
>> choosing a vertical spot for each conversation (generally augmented with
>> a color), so the viewer can easily follow one or the other.  Setting the
>> line position *sort of* lets you do this, but that's hard to get right,
>> since you don't know how far apart to put them.  You'd have to err
>> towards putting them too far apart (guessing the maximum number of lines
>> text might be wrapped to, and covering up much more of the screen than
>> usually needed), or putting one set on the top of the screen (making it
>> completely impossible to read both at once, rather than just
>> challenging).
>>
>> If I remember correctly, SSA files do this with a hack: wherever there's
>> a blank spot in one or the other conversation, a transparent dummy cue
>> is added to keep the other conversation in the correct relative spot, so
>> the two conversations don't swap places.
>>
>> I mention this because it comes to mind as something well-authored,
>> well-rendered subtitles need to get right, and I'm curious if there's a
>> reliable way to do this currently with WebVTT.  If this isn't handled,
>> some scenes just fall apart.
>
> It's intended to be done using the L: feature to pick the lines. If the
> cues have more line wrapping than the author expected, it'll break. The
> only way around that would be to go through the whole file (or at least,
> the whole scene, somehow marked up as such) pre-rendering each cue to work
> out what the maximum line heights would be and then using that offset for
> each cue, etc, but that seems like a whole lot of complexity for a minor
> use case. Is line wrapping really going to be that unpredictable?
>
>
> On Mon, 24 Jan 2011, Philip J盲genstedt wrote:
>>
>> My main point here is that the use cases are so marginal. If there were
>> more compelling ones, it's not hard to support intra-cue language
>> settings using syntax like <lang en>bla</lang> or similar.
>
> Indeed.
>
>
> On Mon, 24 Jan 2011, Glenn Maynard wrote:
>>
>> Here's one that I think was done very well, rendered statically to make
>> sure we're all seeing the same thing:
>>
>> http://zewt.org/~glenn/multiple%20conversation%20example.mpg
>>
>> The results are pretty straightforward.  One always stays on top, one
>> always stays on the bottom, and most of the time the spacing between the
>> two is correct--the normal distance the UA uses between two vertical
>> captions (which would be lost by specifying the line height explicitly).
>> Combined with the separate coloring (which is already possible, of
>> course), it's possible to read both conversations and intuitively track
>> which is which, and it's also very easy to just pick one or the other to
>> read.
>
> As far as I can tell, the WebVTT algorithm would handle this case pretty
> well.
>
>
>> One example of how this can be tricky: at 0:17, a caption on the bottom
>> wraps and takes two lines, which then pushes the line at 0:19 upward
>> (that part's simple enough).  If instead the top part had appeared
>> first, the renderer would need to figure out in advance to push it
>> upwards, to make space for the two-line caption underneith it.
>> Otherwise, the captions would be forced to switch places.
>
> Right, without lookahead I don't know how you'd solve it. With lookahead
> things get pretty dicey pretty quickly.
>
>
> On Mon, 24 Jan 2011, Tab Atkins Jr. wrote:
>>
>> Right now, the WebVTT spec handles this by writing the text in white on
>> top of a partially-transparent black background.  The text thus never
>> has contrast troubles, at the cost of a dark block covering up part of
>> the display.
>>
>> Stroking text is easy, though.  Webkit has an experimental property for
>> doing it directly.  Using existing CSS, it's easy to adapt text-shadow
>> to produce a good outline - just make four shadows, offset by 1px in
>> each direction, and you're good.
>
> WebVTT allows both text-shadow and text-outline.
>
>
> On Wed, 9 Feb 2011, Silvia Pfeiffer wrote:
>>
>> We're trying to avoid the need for multiple transcodings and are trying
>> to achieve something like the following pipeline: broadcast captions ->
>> transcode to WebVTT -> show in browser -> transcode to broadcast devices
>> -> show
>
> Why not just do:
>
>   broadcast captions -> transcode to WebVTT -> show in browser
>
> ...for browsers and:
>
>   broadcast captions -> show
>
> ...for legacy broadcast devices?
>
>
> In any case the amount of legacy broadcast captions pales in comparison to
> the volume of new captions we will see for the Web. I'm not really
> convinced that legacy broadcast captions are an important concern here.
>
>
>> What is the argument against using <u> in captions?
>
> What is the argument _for_ using <u> in captions? We don't add features
> due to a lack of reasons not to. We add features due to a plethora of
> reasons to do so.
>
>
>> > [ foolip suggests using multiple cues to do blinking ]
>>
>> But from a captioning/subtitling point of view it's probably hard to
>> convert that back to blinking text, since we've just lost the semantic
>> by ripping it into multiple cues (and every program would use different
>> ways of doing this).
>
> I do not think round-tripping legacy broadcast captions through WebVTT is
> an important use case. If that is something that we should support, then
> we should first establish why it is an important use case, and then
> reconsider WebVTT within that context, rather than adding features to
> handle it piecemeal.
>
>
>> I guess what we are discovering is that we can define the general format
>> of WebVTT for the Web, but that there may be an additional need to
>> provide minimum implementation needs (a "profile" if you want - as much
>> as I hate this word).
>
> Personally I have nothing against the word "profile", but I do have
> something against providing for "minimum implemenatation needs".
>
> Interoperability means everything works the same everywhere.
>
>
>> [re versioning the file format]
>> In a contract between a caption provider and a caption consumer (I am
>> talking about companies here), the caption consumer will want to tell
>> the caption provider what kind of features they expect the caption files
>> to contain and features they want avoided. This links back to the
>> earlier identified need for "profiles". This is actually probably
>> something outside the scope of this group, but I am sure there is a need
>> for such a feature, in particular if we want to keep the development of
>> the WebVTT specification open for future extensions.
>
> I don't see why there would be a need for anything beyond "make sure it
> works with deployed software", maybe with that being explicitly translated
> to specific features and workarounds for known bugs, e.g. "you can use
> ruby, but make sure you don't have timestamps out of order".
>
> This, however, has no correlation to versions of the format.
>
>
> On Mon, 14 Feb 2011, Philip J盲genstedt wrote:
>> >
>> > [line wrapping]
>>
>> There's still plenty of room for improvements in line wrapping, though.
>> It seems to me that the main reason that people line wrap captions
>> manually is to avoid getting two lines of very different length, as that
>> looks quite unbalanced. There's no way to make that happen with CSS, and
>> AFAIK it's not done by the WebVTT rendering spec either.
>
> WebVTT just defers to CSS for this. I agree that it would be nice for CSS
> to allow UAs to do more clever things here and (more importantly) for UAs
> to actually do more clever things here.
>
>
> On Tue, 15 Feb 2011, Silvia Pfeiffer wrote:
>> foolip wrote:
>> >
>> > Sure, it's already handled by the current parsing spec, since it
>> > ignores everything up to the first blank line.
>>
>> That's not quite how I'm reading the spec.
>>
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#webvtt-0
>> allows
>> "Optionally, either a U+0020 SPACE character or a U+0009 CHARACTER
>> TABULATION (tab) character followed by any number of characters that
>> are not U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)
>> characters."
>> after the "WEBVTT FILE" magic.
>> To me that reads like all of the extra stuff has to be on the same line.
>> I'd prefer if this read "any character except for two WebVTT line
>> terminators", then it would all be ready for such header-style
>> metadata.
>
> That's the syntax rules. It's not the parser.
>
>
>> I'm told <u> is fairly common in traditional captions.
>
> I've never seen it. Do you have any data on this?
>
>
>> > Personally, I think we're going to see more and more devices running
>> > full browsers with webfonts support, and that this isn't going to be a
>> > big problem.
>>
>> I tend to agree and in fact I see that as the shiny future. Just not
>> quite yet.
>
> We're not quite at WebVTT yet either. Currently, there's more support for
> WebFonts than WebVTT.
>
>
> On Tue, 15 Feb 2011, Glenn Maynard wrote:
>>
>> I think that, no matter what you do, people will insert line breaks in
>> cues.  I'd follow the HTML model here: convert newlines to spaces and
>> have a separate, explicit line break like <br> if needed, so people
>> don't manually line-break unless they actually mean to.
>
> The line-breaks-are-line-breaks feature is one of the features that
> originally made SRT seem like a good idea. It still seems like the neatest
> way of having a line break.
>
>
>> Related to line breaking, should there be an &nbsp; escape?  Inserting
>> nbsp literally into files is somewhat annoying for authoring, since
>> they're indistinguishable from regular spaces.
>
> How common would &nbsp; be?
>
>
> On Thu, 10 Feb 2011, Silvia Pfeiffer wrote:
>>
>> Further discussions at Google indicate that it would be nice to make
>> more components optional. Can we have something like this:
>>
>>       [[h*:]mm:]ss[.[d[c[m]]]  | s*[.d[c[m]]]
>>
>> Examples:
>>     23  = 23 seconds
>>     23.2  = 23 sec, 1 decisec
>>     1:23.45   = 1 min, 23 sec, 45 centisec
>>     123.456  = 123 sec, 456 millisec
>
> Currently the syntax is [h*:]mm:ss.sss; what's the advantage of making
> this more complicated? It's not like most subtitled clips will be shorter
> than a minute. Also, why would we want to support multiple redundant ways
> of expressing the same time? (e.g. 01:00.000 and 60.000)
>
> Readability of VTT files seems like it would be helped by consistency,
> which suggests using the same format everywhere, as much as possible.
>
>
> On Sun, 16 Jan 2011, Mark Watson wrote:
>>
>> I have been looking at how the video element might work in an adaptive
>> streaming context where the available media are specified with some kind
>> of manifest file (e.g. MPEG DASH Media Presentation Description) rather
>> than in HTML.
>>
>> In this context there may be choices available as to what to present,
>> many but not all related to accessibility:
>>
>> - multiple audio languages
>> - text tracks in multiple languages
>> - audio description of video
>> - video with open captions (in various languages)
>> - video with sign language
>> - audio with directors commentary
>> - etc.
>>
>> It seems natural that for text tracks, loading the manifest could cause
>> the video element to be populated with associated <track> elements,
>> allowing the application to discover the choices and activate/deactivate
>> the tracks.
>
> Not literal <track> elements, hopefully, but in-band text tracks (known as
> "media-resource-specific text track" in the spec).
>
>
>> But this seems just for text tracks. I know discussions are underway on
>> what to do for other media types, but my question is whether it would be
>> better to have a consistent solution for selection amongst the available
>> media that applies for all media types ?
>
> They're pretty different from each other, so I don't know that one
> solution would make sense for all of these.
>
> Does the current solution (the videoTracks, audioTracks, and textTracks
> attributes) adequately address your concern?
>
>
> On Mon, 17 Jan 2011, Jeroen Wijering wrote:
>>
>> We are getting some questions from JW Player users that HTML5 video is
>> quite wasteful on bandwidth for longer videos (think 10min+). This
>> because browsers download the entire movie once playback starts,
>> regardless of whether a user pauses the player. If throttling is used,
>> it seems very conservative, which means a lot of unwatched video is in
>> the buffer when a user unloads a video.
>>
>> I did a simple test with a 10 minute video: playing it; pausing after 30
>> seconds and checking download progress after another 30 seconds. With
>> all browsers (Firefox 4, Safari 5, Chrome 8, Opera 11, iOS 4.2), the
>> video would indeed be fully downloaded after 60 seconds. Some throttling
>> seems to be applied by Safari / iOS, but this could also be bandwidth
>> fluctuations on my side. Either way, all browsers downloaded the 10min
>> video while only 30 seconds were being watched.
>>
>> The HTML5 spec is a bit generic on this topic, allowing mechanisms such
>> as stalling and throttling but not requiring them, or prescribing a
>> scripting interface:
>>
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-resource
>
> Right, this is an area that is left up to implementations; a quality of
> implementation issue.
>
>
>> A suggestion would be to implement / expose a property called
>> "downloadBufferTarget". It would be the amount of video in seconds the
>> browser tries to keep in the download buffer.
>
> Wouldn't this be very situation-specific? e.g. if I know I'm about to go
> into a tunnel for five minutes, I want five minutes of buffered data. If
> my connection has a high packet loss rate and could stall for upwards of
> 10 seconds, I want way more than 10 seconds in my buffer. If my connection
> is such that I can't download data in realtime, I want the whole video in
> my buffer. If my connection is such that I have 8ms latency to the video
> server and enough bandwidth to transfer the whole four hour file in 3
> seconds, then really I don't need anything in my buffer.
>
>
> On Mon, 17 Jan 2011, Roger H錱ensen wrote:
>> On 2011-01-17 18:36, Markus Ernst wrote:
>> >
>> > Could this be done at the user side, e.g. with some browser setting?
>> > Or even by a "stop downloading" control in the player? An intuitive
>> > user control would be separate stop and pause buttons, as we know them
>> > from tape and CD players. Pause would then behave as it does now,
>> > while stop would cancel downloading.
>>
>> I think that's the right way to do it, this should be in the hands of
>> the user and exposed as a preference in the browsers.
>
> Agreed.
>
>
>> Although exposing (read only?) the user's preferred buffer setting to
>> the HTML App/Plugin etc. would be a benefit I guess as the desired
>> buffering could be communicated back to the streaming server for example
>> for a better bandwidth utilization.
>
> How would the information be used?
>
>
> On Mon, 17 Jan 2011, Zachary Ozer wrote:
>>
>> What no one has mentioned so far is that the real issue isn't the
>> network utilization or the memory capacity of the devices, it's
>> bandwidth cost.
>>
>> The big issue for publishers is that they're incurring higher costs when
>> using the <video> tag, which is a disincentive for adoption.
>>
>> Since there are situations where both the publisher and the user are
>> potentially incurring bandwidth costs (or have other limitations), we
>> could allow the publisher to specify downloadBufferTarget and the user
>> to specify a setting in the browser's config. The browser would then
>> actually buffer min(user setting, downloadBufferTarget). At that point
>> there would probably need to be another read-only property that
>> specified what value the browser is currently using as it's buffer
>> length, but maybe the getter for downloadBufferTarget is sufficient.
>
> I think before we get something that elaborate set up, we should just try
> getting preload="" implemented. :-) That might be sufficent.
>
>
> On Tue, 18 Jan 2011, Robert O'Callahan wrote:
>>
>> One solution that could work here is to honour dynamic changes to
>> 'preload', so switching preload to 'none' would stop buffering. Then a
>> script could do that, for example, after the user has paused the video
>> for ten seconds. The script could also look at 'buffered' to make its
>> decision.
>
> If browsers want to do that I'm quite happy to add something explicitly to
> that effect to the spec. Right now the spec doesn't disallow it.
>
>
> On Wed, 19 Jan 2011, Philip J盲genstedt wrote:
>>
>> The only difference between preload=none and preload=metadata is how
>> much is fetched if the user doesn't interact at all with the video. Once
>> the user has begun playing, I think the two mean the same thing: "please
>> don't waste my bandwidth more than necessary". In other words, I think
>> that for preload=metadata, browsers should be somewhat conservative even
>> after playback has begun, not going all the way to the preload=auto
>> behavior.
>
> The descriptions are somewhat loose, but something like this could work,
> yes. (Though I'd say after playing preload=metadata and preload=auto are
> the same and preload=none is the one that says to avoid bandwidth usage,
> but that's just an artifact of the way I wrote the descriptions.)
>
>
> On Tue, 18 Jan 2011, Zachary Ozer wrote:
>>
>> Currently, there's no way to stop / limit the browser from buffering -
>> once you hit play, you start downloading and don't stop until the
>> resource is completely loaded. This is largely the same as Flash, save
>> the fact that some browsers don't respect the preload attribute. (Side
>> note: I also haven't found a browser that stops loading the resource
>> even if you destroy the video tag.)
>>
>> There have been a few suggestions for how to deal with this, but most
>> have revolved around using downloadBufferTarget - a settable property
>> that determines how much video to buffer ahead in seconds. Originally,
>> it was suggested that the content producers should have control over
>> this, but most seem to favor the client retaining some control since
>> they are the most likely to be in low bandwidth situations. (Publishers
>> who want strict bandwidth control could use a more advanced server and
>> communication layer ala YouTube).
>>
>> The simplest enhancement would be to honor the downloadBufferTarget only
>> when readyState=HAVE_ENOUGH_DATA and playback is paused, as this would
>> imply that there is not a low bandwidth situation.
>
> It seems the simplest enhancement would be to have the browsers do the
> right thing (e.g. download enough to get to HAVE_ENOUGH_DATA and stop if
> the video is paused, or some such), not to add a feature that all Web
> authors would have to handle.
>
>
> On Tue, 18 Jan 2011, Boris Zbarsky wrote:
>>
>> In general, depending on finalizers to release resources (which is
>> what's happening here) is not really a workable setup.  Maybe we need an
>> api to explicitly release the data on an audio/video tag?
>
> The spec suggests removing the element's src="" attribute and <source>
> elements and then calling the element's load() method.
>
> The spec also suggests that implementors release all resources used by a
> media element when that media element is an orphan when the event loop
> spins.
>
> See the "Best practices for authors using media elements" and "Best
> practices for implementors of media elements" sections.
>
>
> On Wed, 19 Jan 2011, Andy Berkheimer wrote:
>>
>> In the case where the viewer does not have enough bandwidth to stream
>> the video in realtime, there are two basic options for the experience:
>> - buffer the majority of the video (per Glenn and Boris' discussion)
>> - switch to a lower bitrate that can be streamed in realtime
>>
>> This thread has focused primarily of the first option and this is an
>> experience that we see quite a bit.  This is the option favored amongst
>> enthusiasts and power users, and also makes sense when a viewer has made
>> a purchase with an expectation of quality.  And there's always the
>> possibility that the user does not have enough bandwidth for even the
>> lowest available bitrate.
>>
>> But the second option is the experience that the majority of our viewers
>> expect.
>>
>> The ideal interface would have a reasonable default behavior but give an
>> application the ability to implement either experience depending on user
>> preference (or lack thereof), viewing context, etc.
>
> Agreed. This is the kind of thing that a good streaming protocol can
> negotiate in realtime.
>
>
>> I believe Chrome's current implementation _does_ stall the HTTP
>> connection (stop reading from the socket interface but keep it open)
>> after some amount of readahead - a magic hardcoded constant. We've run
>> into issues there - their browser readahead buffer is too small and
>> causing a lot of underruns.
>
> It's early days. File bugs!
>
>
>> No matter how much data you pass between client and server, there's
>> always some useful playback state that the client knows and the server
>> does not - or the server's view of the state is stale.  This is
>> particularly true if there's an HTTP proxy between the user agent and
>> the server.  Any behavior that could be implemented through an advanced
>> server/communication layer can be achieved in a simpler, more robust
>> fashion with a solid buffer management implementation that provides
>> "advanced" control through javascript and attributes.
>
> The main difference is that a protocol will typically be implemented a few
> times by experienced programmers writing servers and clients, which will
> then be deployed and used by less experienced (in this kind of thing) Web
> developers, while if we just expose it to JavaScript, the people
> implementing it will be a combination of experienced library authors and
> those same Web developers, and the result will likely be less successful.
>
> However, the two aren't mutually exclusive. We could do one and then later
> (or at the same time) do the other.
>
>
> On Tue, 18 Jan 2011, Roger H氓gensen wrote:
>>
>> It may sound odd but in low storage space situations, it may be
>> necessary to unbuffer what has been played. Is this supported at all
>> currently?
>
> Yes.
>
>
>> I think that the buffering should basically be a "moving window" (I hope
>> most here are familiar with this term?), and that the size of the moving
>> window should be determined by storage space and bandwidth and browser
>> preference and server preference, plus make sure the window supports
>> skipping anywhere without needing to buffer up to it, and avoid
>> buffering from the start just because the user skipped back a little to
>> catch something they missed (another annoyance). This is the only
>> logical way to do this really. Especially since HTTP 1.1 has byterange
>> support there is nothing preventing it from being implemented, and I
>> assume other popular streaming protocols supports byterange as well?
>
> Implementations are allowed to do that.
>
>
> On Tue, 18 Jan 2011, Silvia Pfeiffer wrote:
>>
>> I think that's indeed one obvious improvement, i.e. when going to pause
>> stat, stop buffering when readyState=HAVE_ENOUGH_DATA (i.e. we have
>> reached canplaythrough state).
>
> The spec allows this already.
>
>
>> However, again, I don't think that's sufficient. Because we will also
>> buffer during playback and it is possible that we buffer fast enough to
>> have buffered e.g. the whole of a 10min video by the time we hit pause
>> after 1 min and stop watching. That's far beyond canplaythrough and
>> that's 9min worth of video download wasted bandwidth. This is where the
>> suggested downloadBufferTarget would make sense. It would basically
>> specify how much more to download beyond HAVE_ENOUGH_DATA before pausing
>> the download.
>
> I don't understand how a site can know what the right value is for this.
> Users aren't going to understand that they have to control the buffering
> if (e.g.) they're about to go into a tunnel and they want to make sure
> it's buffered all the way through. It should just work, IMHO.
>
>
> On Tue, 18 Jan 2011, David Singer wrote:
>>
>> If you want a more tightly coupled supply/consume protocol, then use
>> one.  As long as it's implemented by client and server, you're on.
>>
>> Note that the current move of the web towards download in general and
>> HTTP in particular is due in no small part to the fact that getting more
>> tightly coupled protocols -- actually, any protocol other than HTTP --
>> out of content servers, across firewalls, through NATs, and into clients
>> is...still a nightmare.  So, we've been given a strong incentive by all
>> those to use HTTP.  It's sad that some of them are not happy with that
>> result, but it's going to be hard to change now.
>
> Agreed, though in practice there are certainly ways to get two-way
> protocols through. WebSocket does a pretty good job, for example. But
> designing a protocol for this is out of scope for this list, really.
>
>
> On Tue, 18 Jan 2011, David Singer wrote:
>>
>> In RTSP-controlled RTP, there is a tight relationship between the play
>> point, and play state, the protocol state (delivering data or paused)
>> and the data delivered (it is delivered in precisely real-time, and
>> played and discarded shortly after playing).  The server delivers very
>> little more data than is actually watched.
>>
>> In HTTP, however, the entire resource is offered to the client, and
>> there is no protocol to convey play/paused back to the server, and the
>> typical behavior when offered a resource in HTTP is to make a simple
>> binary decision to either load it (all) or not load it (at all).  So, by
>> providing a media resource over HTTP, the server should kinda be
>> expecting this 'download' behavior.
>>
>> Not only that, but if my client downloads as much as possible as soon as
>> possible and caches as much as possible, and yours downloads as little
>> as possible as late as possible, you may get brownie points from the
>> server owner, but I get brownie points from my local user -- the person
>> I want to please if I am a browser vendor.  There is every incentive to
>> be resilient and 'burn' bandwidth to achieve a better user experience.
>>
>> Servers are at liberty to apply a 'throttle' to the supply, of course
>> ("download as fast as you like at first, but after a while I'll only
>> supply at roughly the media rate").  They can suggest that the client be
>> a little less aggressive in buffering, but it's easily ignored and the
>> incentive is to ignore it.
>>
>> So I tend to return to "if you want more tightly-coupled behavior, use a
>> more tightly-coupled protocol"...
>
> Indeed.
>
>
> On Wed, 19 Jan 2011, Philip J盲genstedt wrote:
>>
>> The 3 preload states imply 3 simple buffering strategies:
>>
>> none: don't touch the network at all
>> preload: buffer as little as possible while still reaching readyState
>> HAVE_METADATA
>> auto: buffer as fast and much as possible
>
> "auto" isn't "as fast and much as possible", it's "as fast and much as
> will make the user happy". In some configurations, it might be the same as
> "none" (e.g. if the user is paying by the byte and hates video).
>
>
>> However, the state we're discussing is when the user has begun playing the
>> video. The spec doesn't talk about it, but I call it:
>>
>> invoked: buffer as little as possible without readyState dropping below
>> HAVE_FUTURE_DATA (in other words: being able to play from currentTime to
>> duration at playbackRate without waiting for the network)
>
> There's also a fifth state, let's call it "aggressive", where even while
> playing the video the UA is trying to download the whole thing in case the
> connection drops.
>
>
>> If the available bandwidth exceeds the bandwidth of the resource, some
>> kind of throttling must eventually be used. There are mainly 2 options
>> for doing this:
>>
>> 1. Throttle at the TCP level by not reading data from the socket (not at all
>> to suspend, or at a controlled rate to buffer ahead)
>> 2. Use HTTP byte ranges, making many smaller requests with any kind of
>> throttling at the TCP level
>
> There's also option 3, to handle the fifth state above: don't throttle.
>
>
>> When HTTP byte ranges are used to achieve bandwidth management, it's
>> hard to talk about a single downloadBufferTarget that is the number of
>> seconds buffered ahead. Rather, there might be an upper and lower limit
>> within which the browser tries to stay, so that each request can be of a
>> reasonable size. Neither an author-provided minumum or maximum value can
>> be followed particularly closely, but could possibly be taken as a hint
>> of some sort.
>
> Would it be a more useful hint than "preload"? I'm skeptical about adding
> many hints with no requirements. If there's some specific further
> information we can add, though, it might make sense to add more features
> to "preload".
>
>
>> The above buffering strategies are still not enough, because users seem
>> to expect that in a low-bandwidth situation, the video will keep
>> buffering until they can watch it through to the end. These seem to be
>> the options for solving the problem:
>>
>> * Make sites that want this behavior set .preload='auto' in the 'paused'
>> event handler
>>
>> * Add an option in the context menu to "Preload Video" or some such
>>
>> * Cause an invoked (see dfn above) but paused video to behave like
>> preload=auto
>>
>> * As above, but only when the available bandwidth is limited
>>
>> I don't think any of these solutions are particularly good, so any input
>> on other options is very welcome!
>
> If users expect something, it seems logical that it should just happen. I
> don't have a problem with saying that it should depend on preload="",
> though. If you like I can make the spec explicitly describe what the
> preload="" hints mean while video is playing, too.
>
>
> On Wed, 19 Jan 2011, Zachary Ozer wrote:
>>
>> What if, instead of trying to solve this problem, we leave it up to the
>> publishers. The current behavior would be unchanged, but we could add
>> explicit bandwidth management API calls, ie startBuffer() and
>> stopBuffer(). This would let developers / site publishers control how
>> much to buffer and when.
>
> We couldn't depend on it (most people presumably won't want to do anything
> but give the src="" of their video).
>
>
>> We might also consider leaning on users a bit to tell us what they want.
>> For example, I think people are pretty used to hitting play and then
>> pause to buffer until the end of the video. What if we just used our
>> bandwidth heuristics while in the play state, and buffered blindly when
>> a pause occurs less than X seconds into a video? I won't argue that this
>> is a wonderful solution (or a habit we should encourage), but I figured
>> I'd throw a random idea out there鈥�
> That seems like pretty ugly UI. :-)
>
>
> On Thu, 20 Jan 2011, Glenn Maynard wrote:
>>
>> I think that pausing shouldn't affect read-ahead buffering behavior.
>> I'd suggest another preload value, preload=buffer, sitting between
>> "metadata" and "auto".  In addition to everything loaded by "metadata",
>> it also fills the read-ahead buffer (whether the video is playing or
>> not).
>>
>> - If a page wants prebuffering only (not full preloading), it sets
>> preload=buffer.  This can be done even when the video is paused, so when
>> the user presses play, the video starts instantly without pausing for a
>> server round-trip like preload=metadata.
>
> So this would be to buffer enough to play through assuming the network
> remains at the current bandwidth, but no more?
>
>
>> - If a page wants prebuffering while playing, but unlimited buffering when
>> paused (per Zachary's suggestion), it sets preload=buffer when playing and
>> preload=auto when paused.
>
> Again, note that "auto" doesn't mean "buffer everything", it means "do
> whatever is best for the user".
>
> I don't mind adding new values if the browser vendors are going to use
> them.
>
>
> On Sat, 22 Jan 2011, David Singer wrote:
>>
>> When the HTML5 states were first proposed, I went through a careful
>> exercise to make sure that they were reasonably delivery-technology
>> neutral, i.e. that they applied equally well if say RTSP/RTP was used,
>> some kind of dynamic streaming, simple HTTP, and so on.
>>
>> I am concerned that we all tend to assume that HTML==HTTP, but the
>> source URL for the media might have any protocol type, and the HTML
>> attributes, states etc. should apply (or clearly not apply) to anything.
>>
>> Assuming only HTTP, in the markup, is not a good direction.
>
> Agreed.
>
>
> On Thu, 20 Jan 2011, Matthew Gregan wrote:
>>
>> The media seek algorithm (4.8.10.9) states that the current playback
>> position should be set to the new playback position during the
>> asynchronous part of the algorithm, just before the seeking event is
>> fired. [...]
>
> On Thu, 20 Jan 2011, Philip J盲genstedt wrote:
>>
>> There have been two non-trivial changes to the seeking algorithm in the
>> last year:
>>
>> Discussed at http://lists.w3.org/Archives/Public/public-html/2010Feb/0003.html
>> lead to http://html5.org/r/4868
>>
>> Discussed at http://lists.w3.org/Archives/Public/public-html/2010Jul/0217.html
>> lead to http://html5.org/r/5219
>
> Yeah. In particular, sometimes there's no way for the UA to know
> asynchronously if the seek can be done, which is why the attribute is set
> after the method returns. It's not ideal, but the alternative is not
> always implementable.
>
>
>> With that said, it seems like there's nothing that guarantees that the
>> asynchronous section doesn't start running while the script is still
>> running.
>
> Yeah. It's not ideal, but I don't really see what we can do about it.
>
>
>> It's also odd that currentTime is updated before the seek has actually
>> been completed, but the reason for this is that the UI should show the
>> new position.
>
> Not just the UI. The current position is what the browser is trying to
> play; if the current position didn't move, then the browser wouldn't be
> trying to play it.
>
>
> On Fri, 4 Feb 2011, Matthew Gregan wrote:
>>
>> For anyone following along, the behaviour has now been changed in the
>> Firefox 4 nightly builds.
>
> On Mon, 24 Jan 2011, Robert O'Callahan wrote:
>>
>> I agree. I think we should change behavior to match author expectations
>> and the other implementations, and let the spec change to match.
>
> How do you handle the cases where it's not possible?
>
>
> If all the browsers can do it, I'm all for going back to having
> currentTime change synchronosuly.
>
>
> On Sat, 29 Jan 2011, Lubomir Toshev wrote:
>>
>> [W]hen the video tag has embedded browser controls displayed and I click
>> anywhere on the controls, they cause a video tag click event. If I want
>> to toggle play/pause on video area click, then I cannot do this, because
>> clicking on the play control button, fires play, then click event fires
>> for video tag and when I toggle It pauses. So this behavior that every
>> popular flash player has cannot be achieved. There is no way to
>> understand that the click.target is the embedded browser controls area.
>> I think that a nice improvement will be to expose this information, in
>> the target, that it actually is embedded browser controls. Or clicking
>> the embedded browser controls should not produce a click event for video
>> tag. After all browser controls are native and do not have
>> representation in the DOM. Let me know what do you think about this?
>
> On Sat, 29 Jan 2011, Aryeh Gregor wrote:
>>
>> Well, to begin with, you could just use your own controls rather than
>> the browser's built-in controls.  Then you have no problem.  If you're
>> using the browser's built-in controls, maybe you should stick with the
>> browser's control conventions throughout, which presumably doesn't
>> include toggling play/pause on click.
>>
>> I'm not sure this is a broad enough problem to warrant exposing the
>> extra information in the target.  Are there any other use-cases for such
>> info?
>
> On Sun, 30 Jan 2011, Lubomir Toshev wrote:
>>
>> To elaborate a bit, I'm a control developer and I have my own custom
>> controls. But we want to allow for the customer to use the default
>> browser controls if they want to. This can be done by switching an
>> option in my jQuery widget - browserControls - true/false. Or through
>> browser context menu shown by default on right click. So I'm trying to
>> be flexible enough for the customer.
>>
>> I was thinking about this
>> 1) that adding a transparent overlay over the browser controls
>> Or
>> 2) to detect the click position and if it is some pixels away from the
>> bottom of the video tag
>>
>> will fix this, but every browser has different height for its embedded
>> controls and I should hardcode this height in my code, which is just not
>> manageable.
>>
>> I can always add a limitation when using browser controls, toggle
>> play/pause on video area click will be turned off, but I want to achieve
>> similar behavior in all the browsers no matter whether they use embedded
>> controls or not.
>>
>> So I think this tiny click.target thing will be very useful.
>
> On Sun, 30 Jan 2011, Glenn Maynard wrote:
>>
>> Even as a bad hack it's simply not possible; for example, there's no way
>> to tell whether a pop-out volume control is open or not.
>>
>> I think the primary use case browser controls are meant for is when
>> scripting isn't available at all.  They aren't very useful when you're
>> using any kind of scripts with the video.  Another problem, related to
>> your other post about captioning, is that it's impossible to put
>> anything between the video and the controls, so your captions will draw
>> *on top of* browser controls.
>
> On Mon, 31 Jan 2011, Simon Pieters wrote:
>>
>> See http://lists.w3.org/Archives/Public/public-html/2009Jun/0395.html
>>
>> I suggested that the browser would not generate an event at all when
>> using the native controls. Seemingly there was no reply to Hixie's
>> request for opinion from other implementors.
>
> On Mon, 31 Jan 2011, Glenn Maynard wrote:
>>
>> There are other meaningful ways to respond to these events; for example,
>> to pull its container to the top of the draw order if it's a floating
>> window. I should be able to capture mousedown on the container to do
>> this, regardless of content.
>
> On Mon, 31 Jan 2011, Simon Pieters wrote:
>>
>> How about just suppressing activation events like click?
>
> On Mon, 31 Jan 2011, Glenn Maynard wrote:
>>
>> That makes more sense than suppressing the entire mousedown/mouseup
>> events (and keydown, touchstart, etc).
>>
>> Also, it means you can completely emulate the event behavior of the
>> default browser controls with scripts: preventDefault on mousedown to
>> prevent click events.  That's probably not what you actually want to do,
>> but it means the default controls aren't doing anything special: their
>> effect on events can be understood entirely in terms of what scripted
>> events can already do.
>
> On Mon, 31 Jan 2011, Lubomir Toshev wrote:
>>
>> I totally agree that events should not be raised, when they originate
>> from the native browser controls. This would make it much simpler. I
>> filed the same bug for Opera 11 last week.
>
> As with the post Simon cites above, I'm happy to do this kind of thing, if
> multiple vendors agree that it makes sense. If you would like this to be
> done, I recommend getting other browser vendors to tell me it sounds good!
>
>
> On Sat, 29 Jan 2011, Lubomir Toshev wrote:
>>
>> [V]ideo should expose API for currentFrame, so that when control
>> developers want to add support for subtitles on their own, to be able to
>> support formats that display the subtitles according to the current
>> video frame. This is a limitation to the current design of the video
>> tag.
>
> On Sun, 30 Jan 2011, Lubomir Toshev wrote:
>>
>> We were trying to add support for subtitles for our player control that
>> uses video tag as its base. There are two popular subtitle formats *.srt
>> which uses currentTime to show the subtitles where they should be. Like
>> 0:01:00 - 0:01:30 - "What a nice hotel." While the other popular format
>> is *.sub which uses the currentFrame to show the proper subtitles. Like
>> {45600}, {45689} - "What a nice hotel". And if I want to add this
>> support it would be good if video tag exposes currentFrame, so that I
>> can show properly the subtitles in a span positioned over the video. Now
>> does it make more sense?
>>
>> I know video will have embedded subtitle support, but I think that it
>> should be flexible enough to allow building such features like the one
>> above. What do you think? To me this is worth adding because, it should
>> be really easy to implement?
>
> We'll probably add that along with the metrics, when we add those, if
> there's a strong use case for it. I'm not sure that supporting frame-based
> subtitles is a good use case though.
>
>
> On Mon, 14 Feb 2011, David Flanagan wrote:
>>
>> The draft specification defines 20+ medial event handler IDL attributes
>> on HTMLElement.  These events are non-bubbling and are always targeted
>> at <audio> and <video> tags, so I wonder if they wouldn't be better
>> defined on HTMLMediaElement instead.
>
> All event handlers are on HTMLElement, to make implementations easier and
> to make it the platform simpler.
>
>
> On Tue, 15 Feb 2011, David Flanagan wrote:
>>
>> Fair enough, though I do think it will confuse developers who will think
>> that those media events bubble.  (I'll be documenting them as properties
>> of HTMLMediaElement).
>
> Whether an event bubbles or not is up to the place that dispatches the
> event, not the place that hears the event.
>
>
>> What about Document and Window?  What's the justification for defining
>> the media event handler attributes on those objects?
>
> Same. It allows the same logic to be used everywhere.
>
>
> On Mon, 14 Feb 2011, Kevin Marks wrote:
>> On Mon, Feb 14, 2011 at 2:39 PM, Ian Hickson <ian@hixie.ch> wrote:
>> > On Fri, 19 Nov 2010, Per-Erik Brodin wrote:
>> > >
>> > > We are about to start implementing stream.record() and
>> > > StreamRecorder. The spec currently says that 鈥渢he file must be in
>> > > a format supported by the user agent for use in audio and video
>> > > elements鈥�which is a reasonable restriction. However, there is
>> > > currently no way to set the output format of the resulting File that
>> > > you get from recorder.stop(). It is unlikely that specifying a
>> > > default format would be sufficient if you in addition to container
>> > > formats and codecs consider resolution, color depth, frame rate etc.
>> > > for video and sample size and rate, number of channels etc. for
>> > > audio.
>> > >
>> > > Perhaps an argument should be added to record() that specifies the
>> > > output format from StreamRecorder as a MIME type with parameters?
>> > > Since record() should probably throw when an unsupported type is
>> > > supplied, it would perhaps be useful to have a canRecordType() or
>> > > similar to be able to test for supported formats.
>> >
>> > I haven't added anything here yet, mostly because I've no idea what to
>> > add. The ideal situation here is that we have one codec that everyone
>> > can read and write and so don't need anything, but that may be
>> > hopelessly optimistic.
>>
>> That isn't the ideal, as it locks us into the current state of the art
>> forever. The ideal is to enable multiple codecs +formats that can be
>> swapped out over time. That said, uncompressed audio is readily
>> codifiable, and we could pick a common file format, sample rate,
>> bitdepth and channel caount specification.
>
> It doesn't lock us in to one format, we can always add more formats later.
> Right now, we have zero formats, so one format would be a huge step up.
>
>
> On Fri, 4 Mar 2011, Philip J盲genstedt wrote:
>> On Thu, 03 Mar 2011 22:15:58 +0100, Aaron Colwell <acolwell@google.com>
>> wrote:
>> >
>> > I was looking at the resource fetch
>> > algorithm<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#concept-media-load-resource>section
>> > and fetching resources
>> > <http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#fetch>
>> > sections of the HTML5 spec to determine what the proper behavior is
>> > for handling redirects. Both YouTube and Vimeo do 302 redirects to
>> > different hostnames from the URLs specified in the src attribute. It
>> > looks like the spec says that playback should fail in these cases
>> > because they are from different origins (Section 2.7 Fetching
>> > resources bullet 7). This leads me to a few questions.
>> >
>> > 1. Is my interpretation of the spec correct? Sample YouTube & Vimeo URLs are
>> >   shown below.
>> >   YouTube : src      : http://v22.lscache6.c.youtube.com/videoplayback? ...
>> >             redirect : http://tc.v22.cache6.c.youtube.com/videoplayback?
>> > ...
>> >
>> >   Vimeo   : src      : http://player.vimeo.com/play_redirect? ...
>> >             redirect : http://av.vimeo.com/05 ...
>>
>> Yes, from what I can tell you're correct, but I think it's not
>> intentional. The behavior was changed by <http://html5.org/r/5111> in
>> 2010-06-25, and this is the first time I've noticed it. Opera (and I
>> assume most if not all other browsers) already supports HTTP redirects
>> for <video> and I don't think it makes much sense to disallow it. For
>> security purposes, the origin of the resource is considered to be the
>> final destination, not any of the origins in the redirect chain.
>
> This was fixed recently.
>
>
> On Fri, 18 Mar 2011, Eric Winkelman wrote:
>>
>> For in-band metadata tracks, there is neither a standard way to
>> represent the type of metadata in the HTMLTrackElement interface nor is
>> there a standard way to represent multiple different types of metadata
>> tracks.
>
> There can be a standard way. The idea is that all the types of metadata
> tracks that browsers will support should be specified so that all browsers
> can map them the same way. I'm happy to work with anyone interested in
> writing such a mapping spec, just let me know.
>
>
>> Proposal:
>>
>> For TimedTextTracks with kind=metadata the @label attribute should
>> contain a MIME type for the metadata and that a track only contain Cues
>> created from metadata of that MIME type.
>>
>> This implies that streams with multiple types of metadata require the
>> creation of multiple metadata track objects, one for each MIME type.
>
> This might make sense if we had a defined way of getting such a MIME type
> (and assuming you're talking about the IDL attributes, not the content
> attributes).
>
>
> On Tue, 22 Mar 2011, Eric Winkelman wrote:
>>
>> Ah, yes, now I understand the confusion.  Within the whatwg specs, the
>> word "attribute" is generally used and I was trying to be consistent.
>
> The WHATWG specs refer to content attributes (those on elements) and IDL
> attributes (those on objects, which generate properties in JS). The @foo
> syntax is never used in the WHATWG specs. It's usually used in a W3C
> context just to refer to content attributes, by analogy to the XPath
> syntax. (Personally I prefer foo="" since it's less ambiguous.)
>
>
> On Mon, 21 Mar 2011, Eric Winkelman wrote:
>>
>> No, I'm not saying that, but as far as I can tell from the spec, it is
>> undefined how the user agent should map in-band data to metadata tracks.
>> I am proposing that the algorithm should be that different types of data
>> should go into different Timed Text Tracks, and that the track's @label
>> should reflect the type.
>
> To the extent that it is defined, it is defined here:
>
>   http://www.whatwg.org/specs/web-apps/current-work/complete.html#sourcing-in-band-text-tracks
>
> But the theory, as mentioned above, is that specific types of in-band
> metadata tracks would have explicit specs written to define how the
> mapping is done.
>
>
>> Recent updates to the spec, section 4.8.10.12.2
>> (http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#sourcing-in-band-text-tracks)
>> appear to address my concern in step 2:
>>
>> "2.  Set the new text track's kind, label, and language based on the
>> semantics of the relevant data, as defined by the relevant
>> specification."
>>
>> Provided that the relevant specification defines the metadata type
>> encoding to be put in the label, e.g. application/x-eiss,
>> application/x-scte35, application/x-contentadvisory, etc.
>
> Well the problem is that there typically is no applicable specification,
> or that it is too vague.
>
>
> On Tue, 22 Mar 2011, Lachlan Hunt wrote:
>>
>> This is regarding the recently added audioTracks and videoTracks APIs to
>> the HTMLMediaElement.
>>
>> The design of these APIs seems to be done a little strangely, in that
>> dealing with each track is done by passing an index to each method on
>> the TrackList interfaces, rather than treating the audioTracks and
>> videoTracks as collections of individual audio/video track objects. This
>> design is inconsistent with the design of the TextTrack interface, and
>> seems sub-optimal.
>
> It is intended to avoid an explosion of objects. TextTrack needs to be an
> object because it has separate state, gets targetted for events, has
> different versions (e.g. MutableTextTrack), etc. Audio and Video tracks
> are, on the other hand, rather trivial constructs.
>
>
>> The use of ExclusiveTrackList for videoTracks also seems rather
>> limiting. What about cases where the second video track is a
>> sign-language track, or some other video overlay.
>
> You use a separate <video> element.
>
> I considered this in some depth. The main problem is that you end up
> having to define a layout mechanism for videos if you allow multiple
> videos to be enabled from script (e.g. consider what the behaviour should
> be if you enable the main video, then the PiP sign language video, then
> disable the main video. What is the intrinsic dimension of the <video>
> element? Does it matter if you do it in a different order?).
>
> By making <video> be a single video's output layer, we can bypass many of
> these problems without removing expressibility (the author can still
> support multiple PiP videos).
>
>
>> There are also the use cases for controlling the volume of individual
>> tracks that are not addressed by the current spec design.
>
> Can you elaborate on these use cases?
>
> My assumption has been that on the long term, i you want to manipulate
> specific audio tracks, you would use an <audio> element and plug it into
> the Audio API for separate processing.
>
>
> On Sat, 2 Apr 2011, Bruce Lawson wrote:
>>
>> From a comment in a blog post of mine about longdesc
>> (http://www.brucelawson.co.uk/2011/longdesc-in-html5/comment-page-1/#comment-749853)
>> I'm wondering if this is an appropriate used of <details>
>>
>> <details>
>>   <summary>
>>   <img src=chart.png alt="Graph of percentage of total U.S.
>> non-institutionalized population age 16-64 declaring one or more
>> disabilities">
>>   </summary>
>> <p>The bar graph shows the percentage of total U.S. noninsitutionalized
>> population age 16-64 declaring one or more disabilities. The percentage
>> value for each category is as follows:</p>
>>                               <ul>
>>                                       <li>Total declaring one or more
>> disabilities: 18.6 percent </li>
>>                                       <li>Sensory (visual and hearing): 2.3
>> percent</li>
>>                                       <li>Physical: 6.2 percent</li>
>>                                       <li>Mental: 3.8 percent</li>
>>                                       <li>Self-care: 1.8 percent</li>
>>                                       <li>Diffuculty going outside the home:
>> 6.4 percent</li>
>>                                       <li>Employment disability: 11.9
>> percent</li>
>>                               </ul>
>>                               <p>data retrieved from <a
>> href="http://www.census.gov/prod/2003pubs/c2kbr-17.pdf" title="Link to
>> External Site" class="external">2000 U.S. Census<span> -
>>          external link</span></a></p>
>> </details>
>>
>> .. thereby acting as a discoverable-by-anyone longdesc. (The example is
>> adapted from the longdesc example at
>> http://webaim.org/techniques/images/longdesc#longdesc)
>>
>> Note to grumpy people: I'm not trying to advocate abolishing longdesc,
>> just seeeing whether details can be used as an alternative.
>
> It's a bit weird, but sure.
>
> (Well, except for your alt="" text, which is a title="", not an alt="".)
>
>
> On Sat, 2 Apr 2011, John Foliot wrote:
>>
>> Interesting question. Referring to the spec, I think that you may have
>> in fact uncovered a bug in the text. The spec states:
>>
>>       "The user agent should allow the user to request that the details
>> be shown or hidden."
>>
>> The problem (or potential problem) here is that the behaviour is defined
>> in visual terms -
>
> The spec explicitly says that these terms have non-visual meaning.
>
>
> On Mon, 4 Apr 2011, Bjartur Thorlacius wrote:
>>
>> IMO, the specification of the <details> element is overly focused on
>> expected renderings. Rather than explicitly defining the semantics of
>> <details> with or without an @open attribute, and with or without a
>> <summary> child, sane renderings for medium to large displays whith whom
>> the user can interact are described, and usage is to be inferred
>> therefrom. This is suboptimal, as it allows hiding <details open>s on
>> small output windows but shoulds against it as strongly as ignoring
>> addition of the open attribute. Note that the <details> element
>> represents a disclosure widget, but the contents are nowhere defined
>> (neither as additional information (that a user-agent may or may not
>> render, depending on factors such as scarcity of screen estate), nor as
>> spoiling information that shouldn't be provided to the user without
>> explicit consent). I regard the two different use cases as different,
>> even though vendors might implement both with { binding: details; } on
>> some media. <Details> can't serve both. It's often spoken of as if
>> intended for something else than the YouTube video description use case.
>> <Details> mustn't be used for hiding spoilers, or else browsers won't be
>> able to intelligently choose to render the would-be concealed contents.
>
> I've clarified <details> to be better defined in this respect. I hope it
> addresses your concern.
>
>
> On Fri, 22 Apr 2011, Dimitri Glazkov wrote:
>>
>> I wonder if it makes sense to introduce a set of pseudo-classes on the
>> video/audio elements, each reflecting a state of the media on the
>> controls (playing/paused/error/etc.)? Then, we could use just CSS to
>> style media controls (whether native or custom), and not have to listen
>> to DOM events just to tweak their appearance.
>
> On Sat, 23 Apr 2011, Philip J盲genstedt wrote:
>>
>> With a sufficiently large set of pseudo-classes it might be possible to
>> do *display* most of the interesting state, but how would you *change*
>> the state without using scripts? Play/pause, seek, volume, etc...
>
> On Sat, 23 Apr 2011, Dimitri Glazkov wrote:
>>
>> This is not the goal of using pseudo-classes: they just provide you with
>> a uniform way to react to changes.
>
> On Sat, 23 Apr 2011, Philip J盲genstedt wrote:
>>
>> In other words, one would still have to rely heavily on scripts to
>> actually implement custom controls?
>>
>> Also, how would one style a progress bar using pseudo-classes? How about
>> a displaying elapsed/remaining time on the form MM:SS?
>
> On Sat, 23 Apr 2011, Dimitri Glazkov wrote:
>>
>> I am not in any way trying to invent a magical way to style media
>> controls entirely in CSS. Just trying to make the job of controls
>> developers easier and use CSS where it's well... useful? :)
>
> On Sat, 23 Apr 2011, Philip J盲genstedt wrote:
>>
>> Very well, what specific set pseudo-classes do you think would be
>> useful?
>
> On Sat, 23 Apr 2011, Dimitri Glazkov wrote:
>>
>> I can infer what would be useful from WebKit's media controls as a first
>> stab?
>
> On Mon, 25 Apr 2011, Silvia Pfeiffer wrote:
>>
>> A markup and CSS example would make things clearer. How do you think it
>> would look?
>
> On Sun, 24 Apr 2011, Dimitri Glazkov wrote:
>>
>> Based on WebKit's current media controls, let's start with these pseudo-classes:
>>
>> Play state:
>> - loading
>> - playing
>> - streaming
>> - error
>>
>> Capabilities:
>> - no-audio
>> - no-video
>> - has-closed-captioning
>>
>> So, to show a status message while the control is loading or streaming
>> and hide when it's done:
>>
>> video -webkit-media-controls-status-display {
>>     display: none;
>> }
>>
>>
>> video:loading -webkit-media-controls-status-display, video:streaming
>> -webkit-media-controls-status-display {
>>     display: initial;
>>     ...
>> }
>>
>> Similarly, to hide volume controls when there's no audio:
>>
>> video:no-audio -webkit-media-controls-volume-slider-container {
>>     display: none;
>> }
>>
>> Once I put these pseudo-classes in place for WebKit, a lot of the code in
>> http://codesearch.google.com/codesearch/p#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/html/shadow/MediaControlRootElement.cpp&exact_package=chromium
>> will go away, being replaced with straight CSS.
>
> Sounds to me like a poor man's XBL. I'd much rather see this addressed
> using a full-on binding solution, since it seems like it would be only a
> little more complex yet orders of magnitude more powerful.
>
>
> On Fri, 13 May 2011, Narendra Sisodiya wrote:
>>
>> What i want is a general purpose synchronize mechanism when resource
>> like (text, video, graphics, etc) will be played over a general purpose
>> timer (timeline) with interaction..
>>
>> Ex -
>>
>>        <resource type="html" src="asd.html" x="50%"  y="50%"  width="10%"
>> height="10%" z="6" xpath="page1" tIn="5000ms" tOut="9400ms"
>> inEffect="fadein" outEffect="fadeout" inEffectDur="1000ms"
>> outEffectDur="3000ms"/>
>>
>>        <resource type="html" src="Indian.ogv" x="50%"  y="50%"  width="10%"
>> height="10%" z="6" xpath="page2" tIn="5000ms" tOut="9400ms"
>> inEffect="fadein" outEffect="fadeout" inEffectDur="1000ms"
>> outEffectDur="3000ms"/>
>
> Sounds like SMIL. I recommend looking into SMIL and SVG (which includes
> parts of SMIL).
>
>
> On Fri, 13 May 2011, Philip J盲genstedt wrote:
>>
>> Problem:
>>
>> <video src="video.webm"></video>
>> ...
>> <script>
>> document.querySelector('video').oncanplay = function() {
>>  /* will it run? */
>> };
>> </script>
>>
>> In the above the canplay event can be replaced with many others, like
>> loadedmetadata and loadeddata. Whether or not the event handler has been
>> registered by the time the event is fired depends on how fast decoding
>> is, how fast the network is and how much "..." there is.
>
> Yes, if you add an event listener in a task that runs after the task that
> fires the event could have run, you won't always catch the event.
>
> That's just a bug in the JS.
>
>
> On Fri, 13 May 2011, Henri Sivonen wrote:
>>
>> <iframe src=foo.html></iframe>
>> <script>
>> document.querySelector('iframe').onload = function() {
>>    /* will it run? */
>> };
>> </script>
>> has the same problem. The solution is using the onload markup attribute
>> that calls a function declared in an earlier <script>:
>>
>> <script>
>> function iframeLoaded() {
>>   /* It will run! */
>> }
>> </script>
>> <iframe src=foo.html onload=iframeLoaded()></iframe>
>
> Exactly.
>
>
> On Sat, 14 May 2011, Ojan Vafai wrote:
>>
>> If someone proposed a workable solution, browser would likely implement
>> it. I can't think of a backwards-compatible solution to this, so I agree
>> that developers just need to learn the that this is a bad pattern. I
>> could imagine browsers logging a warning to the console in these cases,
>> but I worry that it would fire too much in today's web.
>
> Indeed.
>
>
>> It's unfortunate that you need to use an inline event handler instead of
>> one registered via addEventListener to avoid the race condition.
>> Exposing something to the platform like jquery's live event handlers (
>> http://api.jquery.com/live/) could mitigate this problem in practice,
>> e.g. it would be just as easy or easier to register the event handler
>> before the element is created.
>
> You can also work around it by setting src="" from script after you've
> used addEventListener, or by checking the state manually after you've
> added the handler and calling the handler if it is too late (though you
> have to be aware of the situation where the event is actually already
> scheduled and you added the listener between the time it was scheduled and
> the time it fired, so your function really has to be idempotent).
>
>
> On Sun, 15 May 2011, Olli Pettay wrote:
>>
>> There is no need to use inline event handler.
>> One can always add capturing listener to window for example.
>> window.addEventListener("canplay",
>>   function(e) {
>>     if (e.target == document.querySelector('video') {
>>       // Do something.
>>     }
>>   }
>> , true);
>> And just do that before the <video> element occurs in the page.
>> That is simple, IMHO.
>
> Indeed, that is another option.
>
>
>> (I wonder why the "Firing a simple event named e" defaults to
>> non-bubbling. It makes many things harder than they should be.)
>
> The default is arbitrary and doesn't affect the platform (since I have
> to decide with each event whether to use the default or not). Changing the
> default would make no difference (I'd just have to go to every site that
> calls the algorithm and switch it from "bubbles" to nothing and nothing to
> "does not bubble").
>
>
> On Sun, 15 May 2011, Glenn Maynard wrote:
>>
>> If a MediaController is being used it's more complicated; there seems to
>> be no way to query the readyState of a MediaController (almost, but not
>> quite, the "most recently reported readiness state"), or to get a list
>> of slaved media elements from a MediaController without searching for
>> them by hand.
>
> If you're scripting the MediaController, the assumption is that you
> created it so there's no problem. The impled MediaControllers are for the
> declarative case where you don't need scripting at all.
>
>
> On Mon, 16 May 2011, Simon Pieters wrote:
>>
>> The state can have changed before the event has actually fired, since
>> state changes are sync but the events are queued. So if the script
>> happens to run in between then func is run twice.
>
> That's true.
>
>
> On Mon, 16 May 2011, Remy Sharp wrote:
>>
>> Now you're right, whoever pointed out the 7am alarm example, if you
>> attach the event too late, then you'll miss the boat.  However, it's a
>> chicken an egg situation.  You don't have the DOM so you can't attach
>> the event handler, and if you do have the DOM, the damn event has fired
>> already.
>>
>> What's the fix?  Well, the work arounds are certainly viable, again from
>> an everyman developer point of view:
>>
>> 1) Attach higher up, on the window object and listen for the
>> canplay/loadedmetadata/etc and check the event.target
>>
>> 2) Attach an inline event handler (not nice, but will do)
>>
>> The fix?  Since ultimately we have exactly the same potential "bug" with
>> image load events
>
> Not just those, also iframes, own document navigation, sockets, XHR,
> anything that does asynchronous work, in fact.
>
>
>> is to update the specification and make it clear: that depending on the
>> speed of the connection and decoding, the following "xyz" events can
>> fire **before** your script runs.  Therefore, here's a couple of work
>> arounds - or just be aware.
>
> I don't really know where to put this that would actually help.
>
>
> On Tue, 17 May 2011, Philip J盲genstedt wrote:
>>
>> Still, I don't think just advocacy is any kind of solution. Given that
>> you (the co-author of an HTML5 book) make certain assumptions about the
>> outcome of this race condition, it's safe to assume that hoards of web
>> developers will do the same.
>>
>> To target this specific pattern, one hypothetical solution would be to
>> special-case the first script that attaches event handlers to a <video>
>> element. After it has run, all events that were already fired before the
>> script are fired again. However, this seems awfully messy if the script
>> also observes readyState or networkState. It might also interfere with
>> browsers that use scripts behind the scenes to implement the native
>> controls.
>>
>> Although a kludge, another solution might be to block events from being fired
>> until x more bytes of the document have been parsed or it has finished
>> loading.
>
> On Wed, 18 May 2011, Robert O'Callahan wrote:
>>
>> For certain kinds of events ("load", the video events, maybe more),
>> delay the firing of such events until, say, after DOMContentLoaded has
>> fired. If you're careful you might be able to make this a strict subset
>> of the behaviors currently allowed by the spec ... i.e. you're
>> pretending that your frame, image and video loads simply didn't complete
>> until after DOMContentLoaded fired in the outer page. That would mean
>> it's compatible with properly-written legacy content ... if there is
>> any.
>>
>> Of course I have no idea whether that approach is actually feasible :-).
>> It obviously isn't compatible with what browsers currently do, so
>> authors wouldn't want to rely on it for a long time if ever.
>
> These don't seem like workable solutions. We can't delay load events for
> every image on the Web, surely. Remembering every event that's ever fired
> for any <img> or <video> just in case a handler is later attached seems a
> bit intractable, too.
>
> This has been a problem since JavaScript was added in the 90s. I find it
> hard to believe that we have to suddenly fix it now.
>
>
> On Tue, 24 May 2011, Silvia Pfeiffer wrote:
>>
>> Ian and I had a brief conversation recently where I mentioned a problem
>> with extended text descriptions with screen readers (and worse still
>> with braille devices) and the suggestion was that the "paused for user
>> interaction" state of a media element may be the solution. I would like
>> to pick this up and discuss in detail how that would work to confirm my
>> sketchy understanding.
>>
>> *The use case:*
>>
>> In the specification for media elements we have a <track> kind of
>> "descriptions", which are:
>> "Textual descriptions of the video component of the media resource,
>> intended for audio synthesis when the visual component is unavailable
>> (e.g. because the user is interacting with the application without a
>> screen while driving, or because the user is blind). Synthesized as a
>> separate audio track."
>>
>> I'm for now assuming that the synthesis will be done through a screen
>> reader and not through the browser itself, thus making the
>> descriptions available to users as synthesized audio or as braille if
>> the screen reader is set up for a braille device.
>>
>> The textual descriptions are provided as chunks of text with a start
>> and a end time (so-called "cues"). The cues are processed during video
>> playback as the video's playback time starts to fall within the time
>> frame of the cue. Thus, it is expected the that cues are consumed
>> during the cue's time frame and are not present any more when the end
>> time of the cue is reached, so they don't conflict with the video's
>> normal audio.
>>
>> However, on many occasions, it is not possible to consume the cue text
>> in the given time frame. In particular not in the following
>> situations:
>>
>> 1. The screen reader takes longer to read out the cue text than the
>> cue's time frame provides for. This is particularly the case with long
>> cue text, but also when the screen reader's reading rate is slower
>> than what the author of the cue text expected.
>>
>> 2. The braille device is used for reading. Since reading braille is
>> much slower than listening to read-out text, the cue time frame will
>> invariably be too short.
>>
>> 3. The user seeked right into the middle of a cue and thus the time
>> frame that is available for reading out the cue text is shorter than
>> the cue author calculated with.
>>
>> Correct me if I'm wrong, but it seems that what we need is a way for
>> the screen reader to pause the video element from continuing to play
>> while the screen reader is still busy delivering the cue text. (In
>> a11y talk: what is required is a means to deal with "extended
>> descriptions", which extend the timeline of the video.) Once it's
>> finished presenting, it can resume the video element's playback.
>
> Is it a requirement that the user be able to use the regular video pause,
> play, rewind, etc, controls to seek inside the extended descriptions, or
> should they literally pause the video while playing, with the audio
> descriptions being controlled by the same UI as the screen reader?
>
>
>> IIUC, a video is "paused for user interaction" basically when the UA has
>> decided to pause the video without the user asking to pause it (i.e. the
>> paused attribute is false) and the pausing happened not for network
>> buffering reasons, but for other reasons. IIUC one concrete situation
>> where this state is used is when the UA has reached the end of the
>> resource and is waiting for more data to come (e.g. on a live stream).
>
> That latter state is not "paused for user interaction", it's just stalled
> due to lack of data. The rest is accurate though.
>
>
>> To use "paused for user interaction" for extending descriptions, we need
>> to introduce a means for the screen reader to tell the UA to pause the
>> video when it reaches the end of the cue and it's still busy delivering
>> a cue's text. Then, as it finishes, it will un-pause the video to let it
>> continue playing.
>>
>> To me it sounds like a feasible solution.
>>
>> The screen reader could even provide a user setting and a short-cut so a
>> user can decide that they don't want this pausing to happen or that they
>> want to move on from the current cue.
>>
>> Another advantage of this approach is that e.g. a deaf-blind user could
>> hook up their braille device such that it will deliver the extended
>> descriptions and also deliver captions through braille with such
>> extension pausing happening. (Not sure that such a user would even want
>> to play the video, but it would be possible.)
>>
>> Now, I think there is one problem though (at least as far as I can
>> tell). Right now, IIUC, screen readers are only passive listeners on the
>> UA. They don't influence the behaviour of the UA. The accessibility API
>> is basically only a one-way street from the UA to the AT. I wonder if
>> that is a major inhibitor of using this approach or whether it's easy
>> for UAs to overcome this limitation? (Or if such a limitation even
>> exists - I don't know enough about how AT work...).
>>
>> Is that an issue? Are there other issues that I have overlooked?
>
> That seems to be entirely an implementation issue.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 3 June 2011 08:22:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:18:36 GMT