W3C home > Mailing lists > Public > whatwg@whatwg.org > October 2011

[whatwg] SRT research: separating cues

From: Simon Pieters <simonp@opera.com>
Date: Tue, 25 Oct 2011 09:18:32 +0200
Message-ID: <op.v3wbw6f1idj3kv@simon-pieterss-macbook.local>
On Mon, 24 Oct 2011 22:50:43 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> So, in your opinion, should there be a change to the WebVTT spec that
> separates cues differently?
> Is there a recommendation you have from your analysis?

My recommendation is http://www.w3.org/Bugs/Public/show_bug.cgi?id=14550

> Cheers,
> Silvia.
>
> On Mon, Oct 24, 2011 at 6:26 PM, Simon Pieters <simonp at opera.com> wrote:
>> I wanted to research how common it is to fail to separate cues in SRT,  
>> and
>> for what reason.
>>
>> SRT parsers usually interpret a timings line as a new cue, while WebVTT
>> wants two blank lines for a new cue.
>>
>> I took the 65k SRT files we've got, replaced comma with dot and  
>> prepended
>> "WEBVTT\n\n", then ran them in Opera's <track> impl, looking for '-->'  
>> in
>> cue data.
>>
>> There were 840 files with --> in cue data. This is 1.3% of the files.
>>
>> Looking at the cue data, there were 11,118 lines that contained -->.  
>> There
>> were 8830 lines of only whitespace.
>>
>> In the cue data, if I look at valid-looking timing lines
>> (/^\d\d:\d\d:\d\d\.\d\d\d\s*-->\s*\d\d:\d\d:\d\d\.\d\d\d(\s|$)/) and  
>> check
>> the line before that, or the line before *that* if it looks like an SRT  
>> id
>> (/^\d+\s*$/), then I see 7030 lines of only whitespace and 3761 lines of
>> something else.
>>
>> Failing to separate cues results in an unpleasant experience for the  
>> user,
>> since basically the screen is filled with several "cues" including  
>> their IDs
>> and timing lines.
>>
>> Some files had most or all of their cues parsed as a single cue with the
>> WebVTT parser, e.g. because all lines ended with one or more spaces.  
>> Looking
>> at such a file in a text editor, it's not immediately obvious that  
>> there's
>> an error, because the spaces are not visible. Moreover, the file is not
>> non-conforming, so a validator wouldn't help either.
>>
>> So what about the cases that aren't whitespace? It seems to be mostly  
>> just
>> missing the newline completely. Some omitted the ID also. One file had  
>> a "|"
>> between all cues.
>>
>> --
>> Simon Pieters
>> Opera Software
>>


-- 
Simon Pieters
Opera Software
Received on Tuesday, 25 October 2011 00:18:32 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:37 UTC