[whatwg] Timed tracks: feedback compendium

On Wed, Sep 8, 2010 at 9:19 AM, Ian Hickson <ian at hixie.ch> wrote:

>
> On Fri, 23 Jul 2010, Philip J?genstedt wrote:
> >
> > I'm not a fan of pauseOnExit, though, mostly because it seems
> > non-trivial to implement. Since it is last in the argument list of
> > TimedTrackCue, it will be easy to just ignore when implementing. I still
> > don't think the use cases for it are enough to motivate the
> > implementation cost.
>
> Really? It seems like automatically pausing video half-way would be a very
> common thing to do; e.g. to play an interstitial ad, or to play a specific
> sound effect in a sound file containing multiple sound effects, or to play
> a video up to the point where the user has to make a choice or has to ask
> to move on to the next slide. There's basically no good way to do this
> kind of thing without this feature.
>

Also, some text cues will be fairly long and thus certain users cannot read
them within the allocated time for the cue. So, making a pauseOnExit()
available is a good thing for accessibility.



>
> > > On Fri, 31 Jul 2009, Silvia Pfeiffer wrote:
> > > >
> > > > * It is unclear, which of the given alternative text tracks in
> > > > different languages should be displayed by default when loading an
> > > > <itext> resource. A @default attribute has been added to the <itext>
> > > > elements to allow for the Web content author to tell the browser
> > > > which <itext> tracks he/she expects to be displayed by default. If
> > > > the Web author does not specify such tracks, the display depends on
> > > > the user agent (UA - generally the Web browser): for accessibility
> > > > reasons, there should be a field that allows users to always turn
> > > > display of certain <itext> categories on. Further, the UA is set to
> > > > a default language and it is this default language that should be
> > > > used to select which <itext> track should be displayed.
> > >
> > > It's not clear to me that we need a way to do this; by default
> > > presumably tracks would all be off unless the user wants them, in
> > > which case the user's preferences are paramount. That's what I've
> > > specced currently. However, it's easy to override this from script.
> >
> > It seems to me that this is much like <video autoplay> in that if we
> > don't provide a markup solution, everyone will use scripts and it will
> > be more difficult for the UA to override with user prefs.
>
> What would we need for this then? Just a way to say "by the way, in
> addition to whatever the user said, also turn this track on"? Or do we
> need something to say "by default, override the user's preferences for
> this video and instead turn on this track and turn off all others"? Or
> something else? It's not clear to me what the use case is where this
> would be useful declaratively.
>


You have covered all the user requirements and that is good. They should
dominate all other settings. But I think we have neglected the authors. What
about tracks that the author has defined and wants activated by default for
those users that don't have anything else specified in their user
requirements? For example, if an author knows that the audio on their video
is pretty poor and they want the subtitles to be on by default (because
otherwise a user may miss that they are available and they may miss what is
going on), then currently they have to activate it with script.

A user whose preferences are not set will thus see this track. For a user
whose preferences are set, the browser will turn on the appropriate tracks
additionally or alternatively if there is a more appropriate track in the
same language (e.g. a caption track over the default subtitle track). If we
do this with script, will it not have the wrong effect and turn off what the
browser has selected, so is not actually expressing author preferences, but
is doing an author override?



> > > On Thu, 15 Apr 2010, Silvia Pfeiffer wrote:
> > > >
> > > > Further, SRT has no way to specify which language it is written in
> > >
> > > What's the use case?
> >
> > As hints for font selection
>
> Are independent SRT processors really going to do per-language font
> selection? How do they do it today?
>

In VLC there is an "Advanced Open File..." option in which you can open a
subtitle file with the video and set the following parameters:
* FPS
* delay
* font size
* subtitle alignment
* subtitle text encoding which chooses the charset.



> > and speech synthesis.
>
> Are independent SRT processors really going to do audio descriptions any
> time soon? I've only ever seen this in highly experimental settings.
>

Once this is usable in the Web context, accessibility people will jump at
this opportunity. It has not been possible before. You should see the
excitement I always get from blind people when I demonstrate the Elephants
Dream video with text audio descriptions. It will totally take off.


> [...] the positioning of individual cues is still not controlled by CSS
> > but rather by e.g. L:50%.
>
> I considered this issue carefully when speccing WebSRT. My conclusion
> (after watching a lot more TV than I'm used to) was that in practice
> subtitle positioning is not strictly a presentational issue -- that is,
> you can't just swap one set of styles for another and have equally good
> results, you have to control the positioning on a per-cue basis regardless
> of the styling. This is because you have to avoid burnt-in text, or
> overlap burnt-in text, or because you need to align text with a speaker,
> or show which audio channel the text came from (e.g. for people talking
> off camera in a very directional sense), etc.
>

I agree. However, what stops us from specifying the positioning in CSS? Why
a new mechanism? The output of rendering the cues ends up as a set of CSS
boxes anyway.



> > Alternatively, might it not be better to simply use the voice "sound"
> > for this and let the default stylesheet hide those cues? When writing
> > subtitles I don't want the maintenance overhead of 2 different versions
> > that differ only by the inclusion of [doorbell rings] and similar.
> > Honestly, it's more likely that I just wouldn't bother with
> > accessibility for the HoH at all. If I could add it with <sound>doorbell
> > rings, it's far more likely I would do that, as long as it isn't
> > rendered by default. This is my preferred solution, then keeping only
> > one of kind=subtitles and kind=captions. Enabling the HoH-cues could
> > then be a global preference in the browser, or done from the context
> > menu of individual videos.
>
> I don't disagree with this, but I fear it might be too radical a step for
> the caption-authoring community to take at this point.
>

I think we have to get over the notion that the existing subtitling
community is our target for this format. In fact, the new subtitling
community are all the Web developers out there. They are the ones we should
target and for them we should make things easier.


> If we must have both kind=subtitles and kind=captions, then I'd suggest
> > making the default subtitles, as that is without a doubt the most common
> > kind of timed text. Making captions the default only means that most
> > timed text will be mislabeled as being appropriate for the HoH when it
> > is not.
>
> Ok, I've changed the default. However, I'm not fighting this battle if it
> comes up again, and will just change it back if people don't defend having
> this as the default. (And then change it back again if the browsers pick
> "subtitles" in their implementations after all, of course.)
>
> Note that captions aren't just for users that are hard-of-hearing. Most of
> the time when I use timed tracks, I want captions, because the reason I
> have them enabled is that I have the sound muted.
>

Hmm, you both have good points. Maybe we should choose something as the
default that is not visible on screen, such as "descriptions"? That would
avoid the issue and make it explicit for people who provide captions or
subtitles that they have to make a choice.


> > - Use existing technologies where appropriate.
> > [...]
> > > - Try as much as possible to have things Just Work.
> >
> > I think by specifying a standalone cue text parser WebSRT fails on these
> > counts compared to reusing the HTML fragment parsing algorithm for
> > parsing cue text.
>
> HTML parsing is a disaster zone that we should avoid at all costs, IMHO. I
> certainly don't think it would make any sense to propagate that format
> into anywhere where we don't absolutely have to propagate it.
>

A WebSRT authoring application does not have to create all markup that a
HTML fragment parser supports. It would only use what it sees necessary for
the use cases that it targets.

Browsers are WebSRT players that will consume the HTML fragments created by
such authoring applications.
In addition, browsers will also be able to consume richer HTML fragments
that were created as time-aligned overlays for video  with more fancy
styling by Web developers. Something like
http://people.mozilla.com/~prouget/demos/vp8/ (you need Firefox for it).
Where it says "This movie will eat your planet", you could have fancy timed
text.

Just as much as there is a need for basic captions and subtitles, there is
also a need for fancy time-aligned HTML fragments. It would be very strange
if, in order to get that working, people would need to use the "metadata"
part of the WebSRT spec.


> > If we don't use HTML wholesale, then there's really no reason to use
> > > HTML at all. (And using HTML wholesale is not really an option, as you
> > > say above.)
> >
> > I disagree. The most obvious way of reusing existing infrastructure in
> > browsers, the most obvious way of getting support for future syntax
> > changes that support attributes or new tag names and the most obvious
> > way to get error handling that behaves in the way the appearance of the
> > syntax suggests is to reuse the HTML fragment parsing algorithm for
> > parsing the cue text.
>
> HTML parsing is one of the most convoluted, quirk-laden, unintuitive and
> expensive syntaxes... Its extensibility story is a disaster (there's so
> many undocumented and continually evolving constraints that any addition
> is massively expensive), its implementation drags with it all kinds of
> crazy dependencies on the DOM, event loop interactions, scripting, and so
> forth, and it has a highly inconsistent syntax.
>
> I'm not at all convinced reusing it would be "obvious".
>

It is obvious to anyone who is not on a standards body. :-)

But seriously: all the things you mention above are advantages: all this
stuff has been solved for HTML and will not have to be solved again if we
reuse it. Anything new will inevitably go through a similar development
path. I don't see this as the opportunity to re-invent HTML when in fact for
anyone out there HTML is working just fine.


On Sun, 25 Jul 2010, Silvia Pfeiffer wrote:
> >
> > I think if we have a mixed set of .srt files out there, some of which
> > are old-style srt files (with line numbers, without WebSRT markup) and
> > some are WebSRT files with all the bells and whistles and with
> > additional external CSS files, we create such a mess for that existing
> > ecosystem that we won't find much love.
>
> I'm not sure our goal is to find love here, but in general I would agree
> that it would be better to have one format than two. I don't see why we
> wouldn't just have one format here though. The idea of WebSRT is to be
> sufficiently backwards-compatible that that is possible.
>

With "finding love" I referred to your expressed goals:
 - Keep implementation costs for standalone players low.
 - Use existing technologies where appropriate.
 - Try as much as possible to have things Just Work.

With WebSRT, we will have one label for two different types of files: the
old-style SRT files and the new WebSRT files. Just putting a single label on
them doesn't mean it is one format, in particular when most old files will
not be conformant to the new label and many new files will not play in the
software created for the old spec.



> On Mon, 26 Jul 2010, Silvia Pfeiffer wrote:
> > > On Thu, 16 Jul 2009, Silvia Pfeiffer wrote:
> > >> * the "type" attribute is meant to both identify the mime type of the
> > >> format and the character set used in the file.
> > >
> > > It's not clear that the former is useful. The latter may be useful; I
> > > haven't supported that yet.
> >
> > If the element is to support a single format in a single character set,
> > then there is no need for a MIME type. So, we need to be clear whether
> > we want to restrict our option here for multiple formats.
>
> As specified the spec supports multiple formats, it just only talks about
> WebSRT currently. (If it becomes likely that browsers will have different
> sets of supported formats, we can add a type="" attribute to help browsers
> find the right files without checking each one, but that's not necessary
> unless that becomes a likely problem.)
>

OK, understood.



> > >> The character set question is actually a really difficult problem to
> > >> get right, because srt files are created in an appropriate character
> > >> set for the language, but there is no means to store in a srt file
> > >> what character set was used in its creation. That's a really bad
> > >> situation to be in for the Web server, who can then only take an
> > >> educated guess. By giving the ability to the HTML author to specify
> > >> the charset of the srt file with the link, this can be solved.
> > >
> > > Yeah, if this is a use case people are concerned about, then I agree
> > > that a solution at the markup level makes sense.
> >
> > If we really are to use WebSRT because (amongst other reasons) it allows
> > reuse of existing srt files, then we need to introduce a means to
> > provide the charset, since almost none of the srt files in the wild that
> > I have looked at were in UTF-8, but in all sorts of other character
> > sets. Another solution to this problem would be to have WebSRT know what
> > charset their characters are in - then we don't need to add such
> > information to the <track> element. It will still not work with legacy
> > SRT files though.
>
> I've added a charset="" attribute to allow authors to provide the
> character encoding for legacy SRT files. WebSRT files are required to be
> UTF-8, however (legacy SRT files that are not UTF-8 are considered
> non-conforming).
>

This supports my understanding that SRT files are a different format to
WebSRT files.



> > You mention that karaoke and lyrics are supported by WebSRT, so could we
> > add them to the track kinds?
>
> Why would they need new script kinds? Isn't "subtitles" enough?
>

Interesting idea.

This actually gets back to the issue that I have mentioned before: we are
actually overloading the meaning of the @kind attribute with many different
things:
* what the data is semantically: subtitle, caption, textual description,
chapters or "metadata" (i.e. "anything")
* whether the data will be visually displayed
* how the data will be parsed

What if, from a semantic viewpoint, people want to have subtitles or
captions always show, but not karaoke or lyrics? I *think* putting karaoke
and lyrics in one pot is ok, but I wonder if we can just throw subtitles and
karaoke in one pot. I'll have to think about it...


> Does the earlier mean that we can only provide text for video and not

> for audio, which has no dimensions?
>
> If you want the browser to render cues, you have to use a <video> element
> so that there is somewhere to render them. You can play audio with the
> <video> element, and you can use <audio> and manually render the cues from
> JS if desired.
>

I see. That wasn't obvious to me, but I can see how that might makes sense.


We could provide an API dedicated to making it easier to render cues
> manually if desired (firing an event or callback with the actual cue for
> each cue that shows, for example).
>

I think that might be a good idea. How would you suggest? Is the oncuechange
not sufficient?



> > And what if we wanted to render captions underneath a video rather than
> > inside video dimensions? Can that be achieved somehow?
>
> You'd need to script it, currently. (I didn't see many (any?) cases of
> this in my research, so I didn't provide a declarative solution.)
>

I've seen it done often on the Web, in particular for descriptions (or timed
transcripts) - it won't appear on TV or desktop caption applications though,
for obvious reasons.

For example, the descriptions on TED are rendered into a container that is
not overlayed onto the video: e.g.
http://www.ted.com/talks/dan_cobley_what_physics_taught_me_about_marketing.html(click
the interactive transcript on the right to display it).
Or the interactive transcript on youtube is timed text that is not rendered
on top of the video but in a box underneath: e.g.
http://www.youtube.com/watch?v=nF3yhZrtLRw .
For captions and subtitles it's less common, but rendering it underneath the
video rather than on top of it is not uncommon, e.g.
http://nihseniorhealth.gov/video/promo_qt300.html or
http://www.fs.fed.us/greatestgood/film/moviefiles/TheGreatestGood_Tr_C_L.movor
http://www.veotag.com/player/Default.aspx?mode=sample&sid=1&pid={516D49AA-72F4-4DA6-91BA-6D225C2782D8}.



> > It is possible to jump to a cue range through its number in the list in
> > the media element using JavaScript and setting the @currentTime to that
> > cue range's start time. However, it has not yet been defined whether
> > there is a relationship between media fragment URIs and timed tracks.
> > The media framgent URI specification has such URIs defined as e.g.
> > http://example.com/video.ogv#id="InTheBathroom" and cues have a textual
> > identifier, so we can put these two together to enable this. Such URIs
> > will then be able to be used in the @src attribute or a media element
> > and focus the view on that cue, just like temporal media fragments do
> > with a random time range.
>
> I'm not sure I follow. Presumably this is all for in-band timed tracks, in
> which case the HTML spec isn't really involved.
>

Yes, I don't think w can make it work for external tracks. It's a media
fragment URI thing, so potentially something for browser vendors to
implement. But indeed, this was taking a jump into the future a bit.



> > For linking out of a cue, there is a need to allow having hyperlinks in
> > cues. IIUC this is currently only possible by using a HTML-style markup
> > in the cue, declaring the cue as kind=metadata and calling
> > getCueAsSource() on the cue, then running your own overlays and shoving
> > the retrieved text to the innerHTML of that overlay.
>
> Having a hyperlink in a cue seems like really bad UI (having any temporal
> interactive UI is typically highly inaccessible, and is generally only
> considered a good idea in games). If you want to make the whole video
> into a link (as Dave suggested in the e-mail above, if I understood it
> correctly) then you don't need anything to do with timed tracks.
>

You can always pause the presentation to follow a given hyperlink. It's
definitely better than having to re-type a URL, which is what is currently
happening in many of the timed annotations in YouTube that leave YouTube. I
don't see why this is bad UI design. In fact, for people with accessibility
issues it is much easier to stop a video and activate a hyperlink than
having to re-type a given hyperlink in captions, subtitles, or worse: in
descriptions. I see the need to support hyperlinks in cues as really
important for accessibility and usability reasons.


> While that works, it seems like a lot of hoops to jump through just to
> > be able to use a bit of HTML markup - in particular having to run your
> > own overlay. Could we introduce a kind=htmlfragment type where it is
> > obvious that the text is HTML and that the fragment parser can be run
> > automatically and display it through the given display mechanisms?
>
> I would on the contrary think that that would be something we should
> _discourage_, not encourage!
>

All that is going to achieve is that we will end up with HTML fragments in
metadata type cues and have to deal with them through JavaScript. I'd much
prefer we have a defined way of dealing with this situation rather than
having it be created inconsistently in JS libraries.


> Many existing subtitle formats and similar media-time-aligned text
> > formats contain file-wide name-value pairs that explain metadata for the
> > complete resource. An example are Lyrics files, e.g.
> >
> > On Tue, 20 Apr 2010, Silvia Pfeiffer wrote:
> > >
> > > Lyrics (LRC) files typically look like this:
> > >
> > > [ti:Can't Buy Me Love]
> > > [ar:Beatles, The]
> > > [au:Lennon & McCartney]
> > > [al:Beatles 1 - 27 #1 Singles]
> > > [by:Wooden Ghost]
> > > [re:A2 Media Player V2.2 lrc format]
> > > [ve:V2.20]
> > > [00:00.45]Can't <00:00.75>buy <00:00.95>me <00:01.40>love,
> > > <00:02.60>love<00:03.30>, <00:03.95>love, <00:05.30>love<00:05.60>
> > > [00:05.70]<00:05.90>Can't <00:06.20>buy <00:06.40>me <00:06.70>love,
> > > <00:08.00>love<00:08.90>
> >
> > You can see that there are title, artist, author, album, related
> > content, version and similar metadata information headers on this file.
> > Other examples contain copyright information and usage rights -
> > important information to understand and deal with when distributing
> > media-time-aligned text files on a medium such as the Web.
>
> I don't really see why we would want to embed this in a timed track. Even
> in HTML embedding this kind of information has never taken off. We would
> need to have very compelling use cases, implementation experience, and
> implementation committements to move in such a direction, IMHO.
>

Dublin Core has been a huge success. Every archive in the world uses that
kind of metadata. I am confused what you mean by metadata in HTML hasn't
taken off. I believe it's only search engines that stopped using metadata
and only because people started mis-using the system. That search engines
stopped using meta elements is a good thing and gave the use of the meta
element back to what it is for: providing machine-readable information, not
for SEO. Such metadata is also relevant to audio and video, just look at the
success of ID3 tags or Vorbis Comment. Similarly, we will need this
capability in timed text files.

I do wonder if Sam Dutton has an opinion on this and may even be keen on
implementation commitment? Sam?


> I would think it'd be good to define a standard means of extracting
> > plain text out of any type of cue, so it will be possible to hand this
> > to e.g. the accessibility API for reading back.
>
> Getting the raw data is already possible, unless I misunderstood what you
> meant.
>

What I meant is to have a getter in TimedTrackCueList that will not return
the cue with its specific markup (WebSRT, JSON or HTML fragment), but
stripped off any of the special markers. This can be very interesting when
wanting to shoot something through to speech recognition or so.


> The rendering and CSS styling approach with ::cue described in

>
> http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0
> > is only defined on WebSRT. That means that there is no styling possible
> > for TimedTracks that come from a different format (assuming we may allow
> > other formats in future).
>
> Styling such formats would be quite possible, it just has to be defined.
>

... and implemented by the browsers, I guess.


> I think this is a bit restrictive and would rather we define a mechanism
> > to allow CSS styling of cues that come from any type of TimedTrack, and
> > thus make the CSS styling part independent of the format.
>
> I don't know how to do that.
>

I believe when you said "it just has to be defined", that actually answered
this question.



> > I think by understanding this and by making this explicit in the spec,

> we can more clearly decide what track kinds are still missing and also
> > what we actually need to implement.
>
> I'm not sure what to add to make this clearer. Can you elaborate?
>

What I meant by this was that in section
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#attr-track-kindwhere
@kind is introduced, there is no mention about the implications of
choosing between these @kind attributes. E.g. if I chose a "description",
then it will not be visible unless I implement that in JavaScript - that is
a pretty big implication that I only found out when I finally got to reading
the rendering section. Also, that section does not provide any hint on what
type of markup will be expected in the cue text - I think that is also a
pretty big implication that should be mentioned in that section.


On Fri, 6 Aug 2010, Silvia Pfeiffer wrote:
> >
> > Note that the subtitling community has traditionally been using the
> > Subrip (srt) or SubViewer (sub) formats as a simple format and
> > SubStation alpha (ssa/ass) as the comprehensive format. Aegisub, the
> > successor of SubStation Alpha, is still the most popular subtitling
> > software and ASS is the currently dominant format. However, even this
> > community is right now developing a new format called AS6. This shows
> > that the subtitling community also hasn't really converged on a "best"
> > format yet.
>
> Also it's worth noting that the SubStation Alpha formats are very
> presentational in nature, and do not follow the HTML school of semantic
> language design at all. That is the main reason I didn't use those formats
> for HTML <video> captions.
>

The new AS6 format is probably more of a semantic language than SubStation
Alpha. BTW: it also includes metadata.


> So, given this background and the particular needs that we have with
> > implementing support for a time-synchronized text format in the Web
> > context, it would probably be best to start a new format from a clean
> > slate rather than building it on an existing format.
>
> I don't follow your reasoning here. As you said, SRT is a common subset of
> most of the formats you listed; why would the conclusion not be that we
> should therefore work with SRT?
>

SRT is not a subset format-wise of the formats listed about. It is a subset
functionality-wise only.



> > In contrast to being flexible about what goes into the cues, WebSRT is
> > completely restrictive and non-extensible in all the content that is
> > outside the cues. In fact, no content other than comments are allowed
> > outside the cues.
>
> None is allowed today, but it would be relatively straight-forward to
> introduce metadata before the cues (or even in between the cues). For
> example, we could add defaults:
>
>   *
>   DEFAULTS
>   L:-1 T:50% A:middle
>
>   00:00:20,000 --> 00:00:24,400
>   Altocumulus clouds occur between six thousand
>
>   00:00:24,600 --> 00:00:27,800
>   and twenty thousand feet above ground level.
>
> We could add metadata (here using a different syntax that is similarly
> backwards-compatible with what the spec parser does today):


>   @charset --> win-1252
>   @language --> en-US
>
>   00:00:20,000 --> 00:00:24,400
>   Altocumulus clouds occur between six thousand
>
>   00:00:24,600 --> 00:00:27,800
>   and twenty thousand feet above ground level.
>
>

When I read the following:
"A WebSRT file body consists of an optional U+FEFF BYTE ORDER MARK (BOM)
character, followed by zero or more WebSRT line
terminators<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#websrt-line-terminator>,
followed by zero or more WebSRT
cues<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#websrt-cue>
separated
from each other by two or more WebSRT line
terminators<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#websrt-line-terminator>,
followed by zero or more WebSRT line
terminators<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#websrt-line-terminator>
."
then that doesn't imply for me that we can add anything in front of the
WebSRT cues without breaking the spec, or that we can define cues that are
not time ranges around the "-->" sign.

I would be very happy in particular with the addition of metadata in this
way.

If the DEFAULTS proposal implies something like inline specified default
styling and positioning (like inline CSS), then that may be useful, too.


There are a variety of syntaxes we could use. So long as whatever we do is
> backwards compatible with what the first set of deployed parsers do, we're
> fine.
>
> Currently comments aren't allowed, but we could add those too (e.g. by
> saying that any block of text that doesn't contain a "-->" is a comment).
>

I must be very confused, but I seemed to remember a comment line being one
that starts with a ";" having been defined for WebSRT. I don't know where
that notion came from and I apologize. I am not actually fussed about adding
comments.


> * there is no possibility to add file-wide metadata to WebSRT; things
> > about authoring and usage rights as well as information about the media
> > resource that the file relates to should be kept within the file. Almost
> > all subtitle and caption format have the possibility for such metadata
>
> This is something we could add if there is a clear use case, but I'm not
> sure that there is. Why does SRT not have it today?
>

Because SRT is a quick hack and the simplest format possible that fulfills
not even bare needs. :-)
But seriously: most formats have metadata and I would rather go with those
experiences than with SRT in this respect.



> > and we know from image, music and video resources how important it is to
> > have the ability to keep such metadata inside the resource.
>
> Do we? I thought from image, music, and video we learnt that it didn't
> make much difference! :-)
>

I think ID3 is very successful, in particular in iTunes, see
http://en.wikipedia.org/wiki/ITunes#File_metadata . The vorbiscomment header
on Xiph files enjoys a similar popularity. And the huge success of EXIF for
images - written by every single digital photo camera and used by every
single photo application. They make a huge difference.




> > * there is no style sheet association for a WebSRT resource; this can be
> > resolved by having the style sheet linked into the Web page where the
> > resource is used with the video, but that's not possible when the
> > resource is used by itself. It needs something like a <link> to a CSS
> > resource inside the WebSRT file.
>
> Do standalone SRT players want to support CSS? If not, it doesn't much
> matter.
>

Stand-alone SRT players wouldn't want to see any of the WebSRT extensions.
Stand-alone WebSRT players - if we define the styling to be in CSS - would
probably want to support whatever is in WebSRT - if that includes CSS, then
that's it. But any of this is just guesswork until we have implementations.


> * there is no magic identifier for a WebSRT resource, i.e. what the
> > <wmml> element is for WMML. This makes it almost impossible to create a
> > program to tell what file type this is, in particular since we have made
> > the line numbers optional. We could use "-->" as an indicator, but it's
> > not a good signature.
>
> Yeah, that's a problem. I considered adding "WEBSRT" at the start of every
> file but we couldn't use it reliably since WebSRT parsers presumably want
> to support SRT using the same parser, and that has no signature.
>

I continue to doubt that you can support WebSRT without changing your SRT
parser. Thus, you might as well make such a change and make it easy for SRT
parsers to identify that it's a WebSRT file to parse and not legacy SRT.


(Note that XML, and anything based on XML, as well as HTML, JS, and CSS,
> have no signature either. It's a common problem of text formats.)
>

Well, there are typical things to parse at the head of XML files, such as
processing instructions or
<!DOCTYPE html>
<html
These *are* magic identifiers.


> * there is no means to identify which parser is required in the cues (is
> > it "plain text", "minimal markup", or "anything"?) and therefore it is
> > not possible for an application to know how it should parse the cues.
>
> Timed track cues are not context-free. In standalone players, the user
> says to play a particular cue file, so using the "cue text" mode is a good
> assumption (why would you give mplayer a metadata cue file to display?).
>

Because it is a .srt file and thus assumed to be supported by mplayer.


> Browsers have the <track> context.
>

Yes, indeed, this is not a problem for browsers.


> * there is no version number on the format, thus it will be difficult to
> > introduce future changes.
>
> Version numbers are an antipattern in multivendor formats. This is an
> intentional feature, not an unfortunate omission. HTML itself has dropped
> the version number in its format; CSS has never had one. Most programming
> languages don't have one.
>

I've accepted this, though I can still see it being useful outside the Web.
But I can see the advantages and disadvantages and I can live without a
version number.



> > I can understand that the definition of WebSRT took inspiration from SRT
> > for creating a simple format. But realistically most SRT files will not
> > be conformant WebSRT files because they are not written in UTF-8.
>
> I don't think they need to be conforming. They're already published.
> Conformance is just a quality assurance tool, it's only relevant for
> documents being written in the future.
>

Conformance is also a problem if players and other tools do not accept files
that are not conformant. I would think Web browser will be highly
restrictive in what they accept - otherwise the spec isn't quite so useful
and we are starting to do quirks again.



> > Further, realistically, all WebSRT files that use more than just the
> > plain text markup are not conformant SRT files.
>
> What's a "conformant SRT file"?
>

Lacking all other formal registrations, it is what Wikipedia defines
http://en.wikipedia.org/wiki/SubRip . :-)
But to be serious again: just because a format doesn't have a formal
registration doesn't mean that it's not specified. For years, this forum
post has been used as the specification of SRT:
http://forum.doom9.org/archive/index.php/t-73953.html and it has been all
that the community needed. It's suboptimal, but it's not an invitation to
re-define the format.


> So, let's stop pretending there is compatibility and just call WebSRT a
> > new format.
>
> Compatibility has nothing to do with conformance. It has to do with what
> user agents do. As far as I can tell, WebSRT is backwards-compatible with
> legacy SRT user agents, and legacy SRT files are compatible with WebSRT
> user agents as described by the spec.
>

Legacy SRT files contain many different character sets, which makes them
non-conformant to WebSRT. I would not think that new WebSRT implementations
like what the Web browsers will need to implement should make exceptions
from the spec to support non-conformant files and become compatible with
legacy SRT files. That to me again confirms that these are two different
formats. Yes, they can be supported by the same piece of code, but that
doesn't make them the same format.



> > In fact, the subtitling community itself has already expressed their
> > objections to building an extension of SRT, see
> > http://forum.doom9.org/showthread.php?p=1396576 , so we shouldn't try to
> > enforce something that those for whom it was done don't want.
>
> The subtitling community in question is looking for a presentational
> format. I think it is very reasonable to say that SRT is not interesting
> for that purpose. However, a presentational format isn't, as far as I can
> tell, suitable for the Web.
>

I am not concerned here with what type of new format they want. I am
concerned about them expressing that it is not desirable to redefine SRT.


> * the mime type of WebSRT resources should be a different mime type to
> > SRT files, since they are so fundamentally different; e.g. text/websrt
>
> That's what I originally suggested, and you said we should use text/srt
> because it is what people use, even though it's not registered. I think
> you were right; it makes no sense to invent a new MIME type here.
>

I don't seem to remember that discussion. Maybe it was a misunderstanding
and I thought you were asking what the mime type of the original SRT files
were. I certainly would not have suggested using it for a new format.


> * the file extension of WebSRT resources should be different from SRT
> > files, e.g. wsrt
>
> Extensions are irrelevant on the Web. People can use whatever extension
> they want.
>


Excellent. They are relevant outside the Web, so if we can at least agree to
have WebSRT resources have a different extension, I would be very happy with
that.



> > Right now, there is "plain text", "minimum markup" and "anything"
> > allowed in the cues.
>
> As far as I can tell there's just two modes -- plain text and text with
> WebSRT markup.
>

@kind=metadata  tracks can have "anything" in them, which is what I regarded
as the third type of markup.

> Seeing as WebSRT is built with the particular purpose of bringing
> > time-synchronized text for HTML5 media elements, it makes no sense to
> > exclude all the capabilities of HTML.
>
> I would on the contrary say that it makes no sense to take on all the HTML
> baggage when all we want to do is introduce subtitles to video. :-)
>

We are introducing functionality for text and events that are executed in a
time-synchronized manner with media elements - this is broader than just
subtitles.


> In the current form, WebSRT only makes limited use of existing CSS. I
> > see particularly the following limitations:
> >
> > * no use of the positioning functionality is made and instead a new
> > means of positioning is introduced; it would be nicer to just have this
> > reuse CSS functionality. It would also avoid having to repeat the
> > positioning information on every single cue.
>
> It doesn't make sense to position cues with CSS, because the position of
> cues is an intrinsic part of the cue semantic. Where a cues appears can
> change the plot of a show, for example (was it the evil twin who said
> something or the good twin?).
>

When I say "CSS" I mean the CSS means of providing in-line @style
information. That is just a different means of providing positioning and
styling information in a cue.



> > * cue-related metadata ("voice") could be made more generic; why not
> > reuse "class"?
>
> I don't know what this means. What is "class" and how does it differ from
> "voice"?
>

I am talking about the @class attribute in use by all HTML elements. It
could be used with a <span> to provide voice metadata and it would be more
flexible than "voice" because it can be associated with text fragments, not
with whole lines of text.


> * I noticed that it is not possible to make a language association with

> segments of text and thus it is not possible to have text with mixed
> > languages.
>
> Are mixed language subtitles common? I don't know that I've ever seen
> that.
>

I have seen several caption files that have at least two languages, possibly
even in the same cue. You even have some at
http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA.


> * Is it possible to reuse the HTML font systems?
>
> What is the HTML font system?
>

Basically stuff defined here:
http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#fonts-and-colors



> On Thu, 12 Aug 2010, Philip J?genstedt wrote:
> >
> > The core "problem" is that WebSRT is far too compatible with existing
> > SRT usage. Regardless of the file extension and MIME type used, it's
> > quite improbable that anyone will have different parsers for the same
> > format. Once media players have been forced to handle the extra markup
> > in WebSRT (e.g. by ignoring it, as many already do) the two formats will
> > be the same, and using WebSRT markup in .srt files will just work, so
> > that's what people will do. We may avoid being seen as arrogant
> > format-hijackers, but the end result is two extensions and two different
> > MIME types that mean exactly the same thing.
>
> I think we'll look equally arrogant if we ignore years of experience with
> subtitling formats and just make up an entirely new format. It's not like
> the world is short of subtitling formats.
>

No matter how we twist it, WebSRT *is* a new format to the subtitling world.


On Wed, 18 Aug 2010, Silvia Pfeiffer wrote:
> >
> > It actually burns down to the question: do we want the simple SRT format
> > to survive as its own format and be something that people can rely upon
> > as not having "weird stuff" in it - or do we not. I believe that it's
> > important that it survives.
>
> Does that format still exist? Is it materially different than WebSRT?
>

What do you mean? All existing SRT files adhere to the simple form of SRT.
None of the adhere to the WebSRT specification.



> On Sat, 21 Aug 2010, Silvia Pfeiffer wrote:
> >
> > It's not just about implementation cost - it's also the problem of
> > maintaining another spec that can grow to have eventually all the
> > features that HTML5 has and more. Do you really eventually want to
> > re-spec and re-implement a whole innerHTML parser plus the extra <t>
> > element when we start putting <svg> and <canvas> and all sorts of other
> > more complex HTML features into captions? Just because the <t> element
> > is making trouble now? Is this really the time to re-invent HTML?
>
> No, it's not. We should never let subtitles get that crazy.
>

Hmm, where have I heard that said before ...
http://www.ibiblio.org/pioneers/lee.html
"Berners-Lee was concerned over some of the new directions the Web was
taking. There were decided differences between his original vision and the
visions of Andreesen and the Netscape crowd. The Web was designed to be a
serious medium."
I think it's a myth to believe one has control over the path a technology
will take and in which way it will be used.


On Mon, 23 Aug 2010, Philip J?genstedt wrote:
> >
> > I don't expect that SVG, <canvas>, images, etc will ever natively be
> > made part of captions. Rather, I would hope that the metadata state
> > together with scripts is used. If we think that e.g. images in captions
> > are an important use case, then WebSRT is not a good solution.
>
> Indeed.
>

Images in captions will be used, I can guarantee that.



> > If we allow arbitrary HTML and expect browsers to handle it well, it
> > adds some complexity. For example, any videos and images in the cue
> > would have to be fully loaded and ready to be decoded by the time the
> > cue is to be shown, which I really don't want to implement the logic
> > for. Simply having an iframe-like container where the document is
> > replaced for each cue wouldn't be enough, rather one would have to
> > create one document per cue during parsing and wait for all of those to
> > finish loading before beginning playback. I'm not sure, but I'm guessing
> > that amounts to significant memory overhead.
>
> Quite.
>

People will do it with HTML in the metadata and then decode it through
JavaScript and throw it at a the HTML fragment parser, including all the
side effects that may have and that they will have to deal with. I'm sure
this will eventually catch up with us. Would it not be better to think about
it now and address it - in particular if you are saying that WebSRT is not
the right solution for this?



> On Tue, 24 Aug 2010, Silvia Pfeiffer wrote:
> >
> > I believe [SVG etc] will be [added to WebSRT]. But since we are only
> > looking at the ways in which captions and subtitles are used currently,
> > we haven't accepted this as an important use case, which is fair enough.
> > I am considering likely future use though, which is always hard to
> > argue.
>
> In all my research for subtitles, I found very few cases of anything like
> this. Even DVDs, whose subtitle tracks are just hardcoded bitmap images,
> don't do anything fancy with them... just plain text and italics,
> generally. Why haven't people started doing fancy stuff with subtitles in
> all the years that we've had TVs? It's not like they can't do it.
>

SVG on the TV? All that was possible was teletext type graphics and indeed,
people did a lot of graphics there, e.g.
http://www.google.com.au/images?q=teletext .


My guess is that the real reason is that when you get so fancy that you're
> including graphics and the like, you're no longer doing timed tracks,
> you're just doing content, and the right thing to do is to either burn it
> in, or consider it a separate construct animated on top of the video, e.g.
> an <svg:video> and SMIL.
>

There was no authoring format available for such things that anything would
support to display. Even the more complex caption formats were really not
supported in any player. Putting it on the Web is a game changer. It will be
easy to author (plenty of people know how to author HTML and will be able to
throw HTML fragments into WebSRT cues) and it will be easy to display (using
some JavaScript and the framework we're putting in place).



> On Wed, 25 Aug 2010, Philip J?genstedt wrote:

>
> > The main reason to care about the MIME type is some kind of "doing the
> > right thing" by not letting people get away with misconfigured servers.
> > Sometimes I feel it's just a waste of everyone's time though, it would
> > generally be less work for both browsers and authors to not bother.
>
> Agreed. Not sure what to do for WebSRT though, since there's no good way
> to recognise a WebSRT file as opposed to some other format.
>

You could put the "WebSRT" string at the start as proposed earlier.


On Thu, 26 Aug 2010, Silvia Pfeiffer wrote:
> >
> > You misunderstand my intent. I am by no means suggesting that no WebSRT
> > content is treated as SRT by any application. All I am asking for is a
> > different file extension and a different mime type and possibly a magic
> > identifier such that *authoring* applications (and authors) can clearly
> > designate this to be a different format, in particular if they include
> > new features.
>
> Wouldn't an authoring application just have two (or more) different "save
> as" or "export" format options? "Save as SRT with no formatting", "Save as
> SRT with <b> only", "Save as WebSRT", or whatnot. Or a list of checkboxes
> for standalone user agents to be compatible with, so that it can pick the
> common subset.
>

Yes, that sounds sensible. So let's make sure they cannot be confused or
overwrite each other by giving WebSRT files a .wsrt extension and keeping
the legacy format with a .srt extension.


> Then a *playback application* has the chance to identify them as a
> > different format and provide a specific parser for it, instead of
> > failing like Totem. They can also decide to extend their existing SRT
> > parser to support both WebSRT and SRT. And I also have no issue with a
> > user deciding to give a WebSRT file a go by renaming it to .srt.
>
> I think you think there's more difference between WebSRT and SRT than
> there is. In practice, there is less difference between WebSRT and the
> equivalent SRT file than there is between two random SRT files today. The
> difference between WebSRT and SRT is well within the "error bars" of what
> SRT is today.
>

A WebSRT file with JSON in the cues is more different to anything that is
called .srt today.


> By keeping WebSRT and SRT as different formats we give the applications
> > a choice to support either, or both in the same parser. If we don't, we
> > force them to deal in a single parser with all the oddities of SRT
> > formats as well as all the extra features and all the extensibility of
> > WebSRT.
>
> I don't understand what the difference would be.
>

An authoring application that loads a WebSRT file should support all
features of WebSRT, even the metadata type and should know what to do with
it. If such a file is clearly marked as .wsrt, the authoring application has
a chance to do the right thing with the file and allow you to continue
editing your JSON content in a special interface for it. If such a file is
marked as .srt, it will just use the cues as they are as caption text. Worse
still: if we have thrown any type of XML into  the cues, all the tags will
be stripped.

Cheers,
Silvia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20100909/d605a84c/attachment-0001.htm>

Received on Thursday, 9 September 2010 06:08:43 UTC