[whatwg] WebSRT feedback from philipj@opera.com on 2010-10-08 (public-whatwg-archive@w3.org from October 2010)

From: <philipj@opera.com>
Date: Fri, 08 Oct 2010 13:54:53 -0700
Message-ID: <201010131634.o9DGYuhp011724@smtp.opera.com>
On Fri, 08 Oct 2010 04:39:43 -0700, Silvia Pfeiffer
<silviapfeiffer1 at gmail.com> wrote:

>
> On 08/10/2010, at 1:28 PM, "Philip J?genstedt" <philipj at opera.com> wrote:
>
>> On Thu, 07 Oct 2010 13:18:37 -0700, Silvia Pfeiffer  
>> <silviapfeiffer1 at gmail.com> wrote:
>>
>>> On Thu, Oct 7, 2010 at 4:06 PM, Philip J?genstedt <philipj at opera.com>  
>>> wrote:
>>>
>>>> On Thu, 07 Oct 2010 01:57:17 -0700, James Graham <jgraham at opera.com>
>>>> wrote:
>>>>
>>>> On 10/06/2010 04:04 AM, Philip J?genstedt wrote:
>>>>>
>>>>> As an aside, the idea of using an HTML parser for the cue text wasn't
>>>>>> very popular.
>>>>>>
>>>>>
>>>>> Why? Were any technical reasons given?
>>>>>
>>>>
>>>> The question was directed at the media player/framework developers  
>>>> present.
>>>> One of them didn't care and one was strongly opposed on the basis of  
>>>> bloat.
>>>> This was an aside, if anyone is serious about using the HTML fragment  
>>>> parser
>>>> for WebSRT, we really should approach the developer mailing lists of  
>>>> media
>>>> players/frameworks. I doubt we will find much love, but would be  
>>>> happy to be
>>>> shown wrong.
>>>
>>>
>>>
>>> The one I talked to said that HTML markup should totally be used in  
>>> cues (he
>>> even mentioned more generally why we didn't pick up USF). The reason  
>>> being
>>> that it clearly defines extensibility and would in fact already  
>>> provide any
>>> use case that anyone can come up with, thus stopping people from  
>>> inventing
>>> their own screwed up extensions, such as the use of ass commands in {}
>>> inside srt subtitles.
>>>
>>> The thing is: while the full set of features of HTML fragments seems  
>>> bloat,
>>> not every subtitle will consist of all the possible markup. Just like  
>>> Web
>>> pages are often created with very simple markup which uses less then  
>>> 1% of
>>> what HTML is capable of, we will see the same happening with subtitle  
>>> cues.
>>> But the availability and clear definition of how such features should  
>>> be
>>> used prevents the introduction of crappy extension.
>>
>> Even if very few subtitles use inline SVG, SVG in <object>, <img>,  
>> <iframe>, <video>, self-referencing <track>, etc in the cue text, all  
>> implementations would have to support it in the same way for it to be  
>> interoperable. That's quite an undertaking and I don't think it's  
>> really worth it.
>>
>
> They all need to be interoperable on all of these features already. It  
> should be easier to keep them interoperable on something known and  
> already implemented than on a set of new features, in particular when  
> the new feature set is restricted and features beyond the limited given  
> set are not available such that custom "markup" will be produced by  
> plugins etc.
>
>
>> As for extensibility, I suggest that we generalize the WebSRT parser  
>> somewhat to produce a normal DOM with elements in a non-HTML namespace  
>> and then use CSS to style them as usual. Unknown element names  
>> shouldn't be valid, of course, but they'd still appear in the DOM. If  
>> "XML5" (http://annevankesteren.nl/2007/10/xml5) was ready, I'd suggest  
>> we use that, with the constraint that it should only be able to output  
>> elements in that non-HTML namespace. (Just thinking out loud here.)
>
> I think that's ok, even though I think it makes more sense to have HTML  
> fragments than arbitrary markup that is related but somewhat different.  
> I think we are then just re-inventing HTML.

On Fri, 08 Oct 2010 05:20:28 -0700, Robert O'Callahan
<robert at ocallahan.org> wrote:

> User agents only need to be interoperable over the common subset of HTML
> features they support. HTML is mostly designed to degrade gracefully  
> when a
> user agent encounters elements it doesn't support. The simplest possible
> video player would use an HTML parser (hopefully off-the-shelf) to build
> some kind of DOM structure. Then it can group text into paragraphs for
> rendering, and ignore the rest of the content.
>
> In practice, we'll have to deal with user agents that support different  
> sets
> of WebSRT features --- when version 2 of WebSRT is developed, if not  
> before.
> Why not use existing, proven machinery --- HTML --- to cope with that
> situation?

I'm making a few assumptions here:

* The cue text format of WebSRT will also be used in WebM when we add
support for in-band captions.

* We want non-browser players and tools to support WebSRT.

It's also worth noting that HTML is lacking semantics for the two most
important aspects of subtitles/captions -- timing and speakers.

If non-browsers get only bloat and no benefit from using an HTML parser.

I do think that a syntax that looks similar to HTML and XML should have
similar parsing, which WebSRT currently doesn't. However,

Main points:

* non-browsers won't want to implement it

* how do browsers implement it?


Even in browsers, it seems to be that using HTML as the cue text formats  
creates lots of complications. I'd like to understand in more detail what  
exactly is being suggested be done with the HTML fragments returned by the  
parser. The following questions mostly revolve around which document the  
fragment will be made part of.

* What are relative links relative to? Is it the containing document or  
the WebSRT resource? When following links, which window is navigated?

* When are external resources like <img>, <object> and <video> loaded?  
What should the Content-Referer be for those requests?

* If a WebSRT cue includes <video autoplay>, when should the video play?

* If a WebSRT cue starting at time 0 includes a self-referring
<video><track> that will be enabled by default, what should happen?


On Fri, 08 Oct 2010 06:00:25 -0700, Jeroen Wijering
<jeroen at longtailvideo.com> wrote:

> The requests we receive on the captioning functionality of the JW Player  
> always revolve around styling. Font size, color, style, weight, outline  
> and family. Block x, y, width, height, text-align, vertical-align,  
> padding, margin, background and alpha. Both for an entire SRT file, for  
> distinct captioning entries and for specific parts of a captioning  
> entry. Not to say that a full parsing engine wouldn't be nice or useful,  
> but at present there's simply no requests for it (not even for <a> ;).  
> Plus, more advanced timed track applications can easily be built with  
> javascript (timed boucing 3D balls using WebGL).
>
> W3C's timed text does a decent job in facilitating the styling needs for  
> captioning authors. Overall regions, single paragraphs and inline chunks  
> (through <span>) can be styled. There are a few small misses, such as  
> text outline, and vertical alignment (which can be done with separate  
> regions though). IMO the biggest con of TT is that it uses its own,  
> in-document styling namespace, instead of relying upon page CSS.


-- 
Philip J?genstedt
Core Developer
Opera Software
Received on Friday, 8 October 2010 13:54:53 UTC