[whatwg] Discussing WebSRT and alternatives/improvements from Philip Jägenstedt on 2010-08-06 (public-whatwg-archive@w3.org from August 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Fri, 06 Aug 2010 17:50:07 +0200
Message-ID: <op.vg0wxtbmsr6mfa@philip-pc.gothenburg.osa>
Snipping liberally...

On Thu, 05 Aug 2010 17:01:47 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> CONSIDERING EXISTING FORMATS

> Note that the subtitling community has traditionally been using the  
> Subrip
> (srt) or SubViewer (sub) formats as a simple format and SubStation alpha
> (ssa/ass) as the comprehensive format. Aegisub, the successor of  
> SubStation
> Alpha, is still the most popular subtitling software and ASS is the
> currently dominant format. However, even this community is right now
> developing a new format called AS6. This shows that the subtitling  
> community
> also hasn't really converged on a "best" format yet.

Maybe it's just me, but I've very rarely seen ASS in the wild. In fact,  
almost all fansubbed videos I've seen were hardsubbed (subtitles encoded  
as part of the video), presumably because the subbers thought their choice  
of font and colors was really important. Perhaps they were still using ASS  
or similar to author the subtitles, I don't know. (In my own limited  
fansubbing efforts, SRT was quite sufficient.)

> A BRIEF OVERVIEW OF WMML

> I developed WMML as a xml-based caption format that will not have the
> problems that have been pointed out for DFXP/TTML, namely: there are no
> namespaces, it doesn't use XSL-FO but instead fully reuses CSS, and it
> supports innerHTML markup in the cues instead of inventing its own  
> markup.
> Check out the examples at
> https://wiki.mozilla.org/Accessibility/Video_Text_Format .
>
> WMML's root element contains some attributes that are important for
> specifying how to use the resource:
> * a @lang attribute which specifies what language the resource is written
> for
> * a @kind attribute which specifies the intended use of the resource,  
> e.g.
> caption, subtitle, chapter, description
> * a @profile attribute which specifies the format used in the cues and  
> thus
> the parser that should be chosen, including "plainText", "minimalMarkup",
> "innerHTML", "JSON", "any" (other formats can be developed)

If @profile should have any influence on the parser it sounds like this  
isn't actually XML at all. In particular, the "HTML" would have to be  
well-formed XML, but would still end up in the null namespace. I guess  
simply cloning the child nodes of <cue> and changing their namespace to   
before inserting them into an iframe-like document might work, but would  
be quite odd, I think you'll agree.

An XML parser can't do anything special with JSON either, so I guess  
there's a post-processing step if an XML parser is going to be used?

> WMML completely reuses the HTML <head> element. This has the following
> advantages:
> * there is a means to associate metadata in the form of name-value pairs
> with the time-synchronized text resource. There is a particular need to  
> be
> able to manage at least the following metadata for time-synchronized text
> resources:
>   ** the association with the media resource and its metadata, such as
> title, actors, url, duration
>   ** the author, authoring date, copyright, ownership, license and usage
> rights for the time-synchronized text resource
> * there is a means to include in-line styles and a means to include a  
> link
> to an external style sheet
> * there is a possibility to provide script that just affects the
> time-synchronized text resource

I agree that some metadata would be useful, more on that below. I'm not  
sure why we would want to run scripts inside the text document, though,  
when that can be accomplished by using the TimedTrack API from the  
containing page.

> The <cue> elements have a start and end time attribute and contain
> innerHTML, thus there is already parsing code available in Web browsers  
> to
> deal with this content. Any Web content can be introduced into a <cue>  
> and
> the Web browsers will already be able to render it.

Yes, but if the HTML parser can't be used for all of WMML, it makes the  
parser quite odd, being neither XML or HTML. I think that realistically  
the best way to make an XML-like format is to simply use XML.

> COMPARING WebSRT and WMML
>
> Examples that I experimented with are at
> https://wiki.mozilla.org/Accessibility/Video_Text_Format_Comparison .
>
> There are a few things I like about WebSRT.
>
> 1. First and foremost I believe that the ability to put different types  
> of
> content into a cue is really powerful.
> It turns WebSRT into a platform for delivering time-synchronized text  
> rather
> than just markup for a particular application of time-synchronized text.  
> It
> makes it future-proof to allow absolutely anything in cues.
>
> 2. There is a natural mapping of WebSRT into in-band text tracks.
> Each cue naturally maps into a encoding page (just like a WMML cue does,
> too). But in WebSRT, because the setup information is not brought in a
> hierarchical element surrounding all cues, it is easier to just chuck
> anything that comes before the first cue into an encoding header page.  
> For
> WMML, this problem can be solved, but it is less natural.

I really like the idea of letting everything before the first timestamp in  
WebSRT be interpreted as the header. I'd want to use it like this:

# author: Fan Subber
# voices: <1> Boy
#         <2> Girl

01:23:45.678 --> 01:23:46.789
<1> Hello

01:23:48.910 --> 01:23:49.101
<2> Hello

It's not critical that the format of the header be machine-readable, but  
we could of course make up a key-value syntax, use JSON, or something else.

> 3. I am not too sure, but the "voice" markup may be useful.
> At this point I do wonder whether it has any further use than a @class
> attribute has in normal markup, but the idea of providing some semantic
> information about the content in cues is interesting. Right now it's only
> used to influence styling but it could have a semantic use, too - uses  
> that
> microformats or RDFa are also targeting.

I think the main benefit is that the voices are all pre-defined, while  
classes can be anything. This limitation makes it possible to write CSS  
that applies consistently to all WebSRT files, whereas otherwise one has  
no idea what classes to target. By the same reasoning it's easier for the  
UA to override those styles.

However, all in all, I'm not particularly fond of the current voice  
markup, mainly for 2 reasons:

First, a cue can only have 1 voice, which makes it impossible to style  
cues spoken/sung simultaneously by 2 or more voices. There's a karaoke  
example of this in  
<http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_voices>

I would prefer if voices could be mixed, as such:

00:01.000 --> 00:02.000
<1> Speaker 1

00:03.000 --> 00:04.000
<2> Speaker 2

00:05.000 --> 00:06.000
<1><2> Speaker 1+2

Second, it makes it impossible to target a smaller part of the cue for  
styling. We have <i> and <b>, but there are also cases where part of the  
cue should be in a different color, see  
<http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA#Multiple_colors>

If one allows multiple voices, it's not hard to predict that people will  
start using magic numbers just to work around this, which would both be  
wrong semantically and ugly to look at:

00:01.000 --> 00:02.000
<1> I like <1234>blue</1234> words.

They'd then target 1234 with CSS to color it blue.

I'm not sure of the best solution. I'd quite like the ability to use  
arbitrary voices, e.g. to use the names/initials of the speaker rather  
than a number, or to use e.g. <shouting> in combination with CSS :before {  
content 'Shouting: ' } or similar to adapt the display for different  
audiences (accessibility, basically).

> 4. It's a light-weight format in that it is not very verbose.
> It is nice for hand-authoring if you don't have to write so much. This is
> particularly true for the simple case. E.g. if new-lines that you author  
> are
> automatically kept as newlines when interpreted. The drawbacks here are  
> that
> as soon as you include more complicated markup into the cues (e.g. HTML
> markup or a SVG image), you're not allowed to put empty lines into it
> because they have a special meaning. So, while it is true that the  
> number of
> characters for WebSRT will always be less than for any markup-based  
> format,
> this may be really annoying in any of the cases that need more than plain
> text.

It would be easy to just let the parser consume all lines until the next  
timestamp, but do you really want to separate two lines with a blank line?  
If the two lines aren't really related, one could instead have two cues  
with different vertical positioning.

> Now, I've tried to include point 1 into WMML, but because WMML is  
> xml-based,
> the ability to include any kind of markup into cues is not so elegant. It
> is, however, controlled by the @profile attribute on the <wmml> element,  
> so
> applications should be able to deal with it.

As outlined above, unless one can actually use an XML parser to parse  
WMML, its similarity with XML isn't very useful.

> Point 2 is possible in WMML through "encoding" all outer markup in a  
> header
> and the cues in the data packets.

To be clear, this would be a new codec type for the container, since I'm  
not aware of any that allow stating that the cue text is HTML. The same is  
true of WebSRT, muxing it into e.g. WebM would require the ability to  
express the kind from <track kind="captions"> (although in practice such  
metadata in binary files ends up almost always being incorrect).

> Point 3 is also possible in WMML through the use of the @class attribute  
> on
> cues.

I'd want this or something like it in WebSRT.

> Now to the things that WMML provides where WebSRT needs to improve.
>
>
> 1. Extensibility with header data.
>
> In contrast to being flexible about what goes into the cues, WebSRT is
> completely restrictive and non-extensible in all the content that is  
> outside
> the cues. In fact, no content other than comments are allowed outside the
> cues. This creates the following problems:
>
> * there is no possibility to add file-wide metadata to WebSRT; things  
> about
> authoring and usage rights as well as information about the media  
> resource
> that the file relates to should be kept within the file. Almost all  
> subtitle
> and caption format have the possibility for such metadata and we know  
> from
> image, music and video resources how important it is to have the ability  
> to
> keep such metadata inside the resource.
>
> * there is no language specification for a WebSRT resource; while this  
> will
> not be a problem when used in conjunction with a <track> element, it  
> still
> is a problem when the resource is used just by itself, in particular as a
> hint for font selection and speech synthesis.

The language inside the WebSRT file wouldn't end up being used for  
anything by a browser, as it needs to know the language before downloading  
it to know whether or not to download it at all. Still, I'd like a header  
section in WebSRT. I think the parser is already defined so that it would  
ignore garbage before the first cue, so this is more a matter of making it  
legal syntax.

> * there is no magic identifier for a WebSRT resource, i.e. what the  
> <wmml>
> element is for WMML. This makes it almost impossible to create a program  
> to
> tell what file type this is, in particular since we have made the line
> numbers optional. We could use "-->" as an indicator, but it's not a good
> signature.

If it's more important than easy-of-authoring, we could require WebSRT  
files to begin with a magic string and require browsers to reject them  
otherwise. I don't support this though, there's not much benefit.

> * there is no means to identify which parser is required in the cues (is  
> it
> "plain text", "minimal markup", or "anything"?) and therefore it is not
> possible for an application to know how it should parse the cues.

All the types that are actually for visual rendering are parsed in the  
same way, aren't they? Of course there's no way for non-browsers to know  
that metadata tracks aren't interesting to look at as subtitles, but I  
think showing the user the garbage is a quicker to communicate that the  
file isn't for direct viewing than hiding the text or similar.

> * there is no version number on the format, thus it will be difficult to
> introduce future changes.

I think we shouldn't have a version number, for the same reason that CSS  
and HTML don't really have versions. If we evolve the WebSRT spec, it  
should be in a backwards-compatible way.

> 2. Break the SRT link.

> * the mime type of WebSRT resources should be a different mime type to  
> SRT
> files, since they are so fundamentally different; e.g. text/websrt
>
> * the file extension of WebSRT resources should be different from SRT  
> files,
> e.g. wsrt

I'm not sure if either of these would make a difference.

> 3. Introduce a innerHTML type for cues
>
> Right now, there is "plain text", "minimum markup" and "anything"  
> allowed in
> the cues. Seeing as WebSRT is built with the particular purpose of  
> bringing
> time-synchronized text for HTML5 media elements, it makes no sense to
> exclude all the capabilities of HTML. Also, with all the typical parsers  
> and
> renderers available in UAs, support of innerHTML in cues should be  
> simple to
> implement. The argument that offline applications don't support it is not
> relevant since we have no influence on whether standalone media  
> applications
> will actually follow the HTML5 format choice. That WebSRT with "plain  
> text"
> and "minimal markup" can be supported easily in standalone media
> applications is a positive side effect, but not an aim in itself for  
> HTML5
> and it should have no influence on our choices.

Right, using the fragment parser wouldn't be too difficult. If we do that,  
I think it should completely replace the current thinking with voices. At  
this point I don't have a strong opinion on which is better, though.

> 4. Make full use of CSS
>
> In the current form, WebSRT only makes limited use of existing CSS. I see
> particularly the following limitations:
>
> * no use of the positioning functionality is made and instead a new  
> means of
> positioning is introduced; it would be nicer to just have this reuse CSS
> functionality. It would also avoid having to repeat the positioning
> information on every single cue.

I agree, the positioning syntax isn't something I'm happy about with  
WebSRT. I think treating everything that follows the timestamp to be CSS  
that applies to the whole cue would be better.

> * little use of formatting functionality is made by restricting it to  
> only
> use 'color', 'text-shadow', 'text-outline', 'background', 'outline' and
> 'font'

I agree, there should be no white-listing of CSS properties. Should we be  
the first to implement this, I would advocate ignoring the spec on this  
point.

> * cue-related metadata ("voice") could be made more generic; why not  
> reuse
> "class"?

I agree and would like something more open-ended, perhaps by allowing the  
voice to be any string as per above.

> * there is no definition of the "canvas" dimensions that the cues are
> prepared for (width/height) and expected to work with other than saying  
> it
> is the video dimensions - but these can change and the proportions  
> should be
> changed with that

I'm not sure what you're saying here. Should the subtitle file be  
hard-coded to a particular size? In the quite peculiar case where the same  
subtitles really don't work at two different resolutions, couldn't we just  
have two files? In what cases would this be needed?

> * it is not possible to associate CSS styles with segments of text, but  
> only
> with a whole cue using ::cue-part; it's thus not possible to just  
> highlight
> a single word in a cue
> * when HTML markup is used in cues, as the specification stands, that  
> markup
> is not parsed and therefore cannot be associated with CSS; again, this  
> can
> be fixed by making innerHTML in cues valid

I agree with both of the above.

> IN SUMMARY
>
> Having proposed a xml-based format, it would be good to understand  
> reasons
> for why it is not a good idea and why a plain text format that has no
> structure other than that provided through newlines and start/end time
> should be better and more extensible.
>
> Also, if we really are to go with WebSRT, I am looking for a discussion  
> on
> those suggested improvements.

Thanks, lots of good suggestions and feedback. To sum it up, I wouldn't be  
opposed to an XML format as such, but it seems that WMML isn't quite XML.  
WebSRT also has its problems, or course...

-- 
Philip J?genstedt
Core Developer
Opera Software
Received on Friday, 6 August 2010 08:50:07 UTC