[whatwg] Fwd: Discussing WebSRT and alternatives/improvements from Philip Jägenstedt on 2010-08-11 (public-whatwg-archive@w3.org from August 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 11 Aug 2010 14:30:24 +0200
Message-ID: <op.vg9w0yl6sr6mfa@philip-pc.gothenburg.osa>
On Wed, 11 Aug 2010 01:43:01 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> On Tue, Aug 10, 2010 at 7:49 PM, Philip J?genstedt  
> <philipj at opera.com>wrote:
>
>> On Tue, 10 Aug 2010 01:34:02 +0200, Silvia Pfeiffer <
>> silviapfeiffer1 at gmail.com> wrote:
>>
>>  On Tue, Aug 10, 2010 at 12:04 AM, Philip J?genstedt <philipj at opera.com
>>> >wrote:
>>>
>>>  On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <
>>>> silviapfeiffer1 at gmail.com> wrote:
>>>>
>>>>>
>>>>> I guess this is in support of Henri's proposal of parsing the cue  
>>>>> using
>>>> the
>>>> HTML fragment parser (same as innerHTML)? That would be easy to
>>>> implement,
>>>> but how do we then mark up speakers? Using <span  
>>>> class="narrator"></span>
>>>> around each cue is very verbose. HTML isn't very good for marking up
>>>> dialog,
>>>> which is quite a limitation when dealing with subtitles...
>>>>
>>>
>>> I actually think that the <span @class> mechanism is much more flexible
>>> than
>>> what we have in WebSRT right now. If we want multiple speakers to be  
>>> able
>>> to
>>> speak in the same subtitle, then that's not possible in WebSRT. It's a
>>> little more verbose in HTML, but not massively.
>>>
>>> We might be able to add a special markup similar to the <[timestamp]>
>>> markup
>>> that Hixie introduced for Karaoke. This is beyond the innerHTML parser  
>>> and
>>> I
>>> am not sure if it breaks it. But if it doesn't, then maybe we can also
>>> introduce a <[voice]> marker to be used similarly?
>>>
>>
>> An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>"  
>> and
>> "<00:01:30>". Without having read the HTML parsing algorithm I guess  
>> that
>> elements need to begin with a letter or similar. So, it's not possible  
>> to
>> (ab)use the HTML parser to handle inner timestamps of numerical voices,  
>> we'd
>> have to replace those with something else, probably more verbose.
>
>
>
> I have checked the parse spec and
> http://www.whatwg.org/specs/web-apps/current-work/#tag-open-state indeed
> implies that a tag starting with a number is a parse error. Both, the
> timestamps and the voice markers thus seem problems when going with an
> innerHTML parser. Is there a way to resolve this? I mean: I'd quite  
> happily
> drop the voice markers for a <span @class> but I am not sure what to do
> about the timestamps. We could do what I did in WMML and introduce a <t>
> element with the timestamp as a @at attribute, but that is again more
> verbose. We could also introduce an @at attribute in <span> which would  
> then
> at least end up in the DOM and can be dealt with specially.

What should numerical voices be replaced with? Personally I'd much rather  
write <philip> and <silvia> to mark up a conversation between us two, as I  
think it'd be quite hard to keep track of the numbers if editing subtitles  
with many different speakers. However, going with that and using an HTML  
parser is quite a hack. Names like <mark> and <li> may already have  
special parsing rules or default CSS.

Going with HTML in the cues, we either have to drop voices and inner  
timestamps or invent new markup, as HTML can't express either. I don't  
think either of those are really good solutions, so right now I'm not  
convinced that reusing the innerHTML parser is a good way forward.

>>  Think for example about the case where we had a requirement that a  
>> double
>>> newline starts a new cue, but now we want to introduce a means where  
>>> the
>>> double newline is escaped and can be made part of a cue.
>>>
>>> Other formats keep track of their version, such as MS Word files. It  
>>> is to
>>> be hoped that most new features can be introduced without breaking
>>> backwards
>>> compatibility and we can write the parsing requirements such that  
>>> certain
>>> things will be ignored, but in and of itself, WebSRT doesn't provide  
>>> for
>>> this extensibility. Right now, there is for example extensibility with  
>>> the
>>> "WebSRT settings parsing" (that's the stuff behind the timestamps)  
>>> where
>>> further "setting:value" settings can be introduced. But for example the
>>> introduction of new "cue identifiers" (that's the <> marker at the  
>>> start
>>> of
>>> a cue) would be difficult without a version string, since anything that
>>> doesn't match the given list will just be parsed as cue-internal tag  
>>> and
>>> thus end up as part of the cue text where plain text parsing is used.
>>>
>>
>> The bug I filed suggested allowing arbitrary voices, to simplify the  
>> parser
>> and to make future extensions possible. For a web format I think this  
>> is a
>> better approach format than versioning. I haven't done a full review of  
>> the
>> parser, but there are probably more places where it could be more  
>> forgiving
>> so as to allow future tweaking.
>
>
>
> That's a good approach and will reduce the need for breaking
> backwards-compatibility. In an xml-based format that need is 0, while  
> with a
> text format where the structure is ad-hoc, that need can never be  
> reduced to
> 0. That's what I am concerned about and that's why I think we need a  
> version
> identifier. If we end up never using/changing the version identifier, the
> better so. But I'd much rather we have it now and can identify what
> specification a file adheres to than not being able to do so later.

Perhaps I'm too influenced by HTML and its failed attempts at versioning,  
but I think that if you want to know which version of a spec a document is  
written against, you can run it through a parser for each version. This  
doesn't tell you the author intent, but I'm not sure that's very  
interesting to know. If the author thinks it's important, perhaps it can  
be put in a comment in the header.

>>  On the other hand, keeping the same extension and (unregistered) MIME  
>> type
>>>> as SRT has plenty of benefits, such as immediately being able to use
>>>> existing SRT files in browsers without changing their file extension  
>>>> or
>>>> MIME
>>>> type.
>>>>
>>>
>>>
>>> There is no harm for browsers to accept both MIME types if they are  
>>> sure
>>> they can parse old srt as well as new websrt. But these two formats are
>>> different enough that they should be given a different extension and  
>>> mime
>>> type. I do not see a single advantage in stealing the MIME type of an
>>> existing format for a new specification.
>>>
>>
>> But there's no spec for the old SRT, the only thing one could do is  
>> parser
>> it with a WebSRT parser.
>
>
> I can write that spec in an afternoon and register the mime type with  
> IANA.
> That really isn't a problem. People have managed to write correct SRT  
> files
> without having a spec, because it's so trivial. Creating a spec is just a
> formality. For now, the wikipedia page really is sufficient.

Having a separate spec isn't really useful unless we expect people to  
implement it. Perhaps some new implementations would follow the spec, but  
browsers sure wouldn't implement two different parsers.

>> That would make text/srt and text/websrt synonymous, which is kind of
>> pointless.
>
>
> No, it's only pointless if you are a browser vendor. For everyone else  
> it is
> a huge advantage to be able to choose between a guaranteed simple format  
> and
> a complex format with all the bells and whistles.
>
>
>
>> The advantages of taking text/srt is that all existing software to  
>> create
>> SRT can be used to create WebSRT
>
>
> That's not strictly true. If they load a WebSRT file that was created by
> some other software for further editing and that WebSRT file uses  
> advanced
> WebSRT functionality, the authoring software will break.

Right, especially settings appended after the timestamps are quite likely  
to be stripped when saving the file.

>> and servers that already send text/srt don't need to be updated. In  
>> either
>> case I think we should support only one mime type.
>
>
> What's the harm in supporting two mime types but using the same parser to
> parse them?

Most content will most likely be plain old SRT without voices, <ruby> or  
similar. People will create them using existing software with the .srt  
extension and serve them using the text/srt MIME type. When they later  
decide to add some <ruby> or similar, it will just work without changing  
the extension or MIME type. The net result is that text/srt and  
text/websrt mean exactly the same thing, making it a wasted effort.

>>   * there is no definition of the "canvas" dimensions that the cues are
>>>>
>>>>>
>>>>>>  prepared for (width/height) and expected to work with other than
>>>>>>> saying
>>>>>>> it
>>>>>>> is the video dimensions - but these can change and the proportions
>>>>>>> should
>>>>>>> be
>>>>>>> changed with that
>>>>>>>
>>>>>>>
>>>>>>>  I'm not sure what you're saying here. Should the subtitle file be
>>>>>> hard-coded to a particular size? In the quite peculiar case where  
>>>>>> the
>>>>>> same
>>>>>> subtitles really don't work at two different resolutions, couldn't  
>>>>>> we
>>>>>> just
>>>>>> have two files? In what cases would this be needed?
>>>>>>
>>>>>>
>>>>>
>>>>> Most subtitles will be created with a specific width and height in  
>>>>> mind.
>>>>> For
>>>>> example, the width in characters relies on the video canvas having at
>>>>> least
>>>>> that size and the number of lines used usually refers to a lower  
>>>>> third
>>>>> of
>>>>> a
>>>>> video - where that is too small, it might cover the whole video. So,  
>>>>> my
>>>>> proposal is not the hard-code the subtitles to a particular size,  
>>>>> but to
>>>>> put
>>>>> the minimum width and height that are being used for the creation of  
>>>>> the
>>>>> subtitles into the file. Then, the file can be scaled below or above
>>>>> this
>>>>> size to adjust to the actual available space.
>>>>>
>>>>>
>>>> In practice, does this mean scaling font-size by
>>>> width_actual/width_intended or similar? Personally, I prefer  
>>>> subtitles to
>>>> be
>>>> something like 20 screen pixels regardless of video size, as that is
>>>> readable. Making them bigger hides more of the video, while making  
>>>> them
>>>> smaller makes them hard to read. But I guess we could let the CSS  
>>>> media
>>>> query min-width and similar be evaluated against the size of the
>>>> containing
>>>> video element, to make it possible anyway.
>>>>
>>>
>>>
>>>
>>> Have you ever tried to keep the small font size of subtitles on a  
>>> 320x240
>>> video when going full-screen? They are almost unusable at that size.
>>> YouTube
>>> doesn't do a good job at that, incidentally, so you can go check it out
>>> there - go full-screen and see how tiny the captions become then step  
>>> back
>>> from your screen to where you'd want to watch the video from and notice
>>> how
>>> the captions are basically unreadable.
>>>
>>> When you scale the font-size with the video, you do not hide more of  
>>> the
>>> video - you hide the exact same part of the video. Video and font get
>>> larger
>>> in the same way. And that's exactly the need that we have.
>>>
>>
>> Existing media players have basically two different ways of handling  
>> this.
>> The kind you're describing is like MPlayer, where subtitles appear to
>> actually be rendered on to the video frames and then scaled together  
>> with
>> the video. The kind I've used more is like Totem, where subtitles are
>> rendered in a separate layer at a fixed size in pixels, regardless of
>> whether or not you're watching in fullscreen. This means that word  
>> wrapping
>> will be different depending on screen size.
>
>
> In the Totem case, does the font size increase with a change in screen  
> size?

Oops, on closer inspection I am completely wrong, the text is actually  
rendered and scaled with the video, just a bit prettier than MPlayer does  
it. Maybe the prettiness lead me to believe it was somehow different. Sigh.

> My suggestion is to have them in different layers, but there is knowledge
> about the intended anchoring, i.e. where is the text supposed to appear  
> on
> the video screen. The keep that anchoring intact no matter what the video
> size.
>
>
>
>> I find both MPlayer's and Totem's behavior annoying in some situations,  
>> but
>> personally prefer Totem most of the time.
>>
>
> Do you find MPlayer's behavior annoying because by rescaling already
> rendered text, the text loses resolution and becomes less readable? This  
> is
> definitely not the behaviour I am after.

Scaling with the video is annoying with small videos, as the text ends up  
being huge in fullscreen. I assume we're going to do scaling as well as we  
can, so that's not an argument in either direction.

I'll have to withdraw any opinion for now, I don't know how to best deal  
with this.

-- 
Philip J?genstedt
Core Developer
Opera Software
Received on Wednesday, 11 August 2010 05:30:24 UTC