[whatwg] SRT research: separating cues from Silvia Pfeiffer on 2011-10-24 (public-whatwg-archive@w3.org from October 2011)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Tue, 25 Oct 2011 07:50:43 +1100
Message-ID: <CAHp8n2nA_RdHdbCCvHZgRo6fXdb1E4XosSOFb-7OamHCXL63UQ@mail.gmail.com>

So, in your opinion, should there be a change to the WebVTT spec that
separates cues differently?
Is there a recommendation you have from your analysis?
Cheers,
Silvia.

On Mon, Oct 24, 2011 at 6:26 PM, Simon Pieters <simonp at opera.com> wrote:
> I wanted to research how common it is to fail to separate cues in SRT, and
> for what reason.
>
> SRT parsers usually interpret a timings line as a new cue, while WebVTT
> wants two blank lines for a new cue.
>
> I took the 65k SRT files we've got, replaced comma with dot and prepended
> "WEBVTT\n\n", then ran them in Opera's <track> impl, looking for '-->' in
> cue data.
>
> There were 840 files with --> in cue data. This is 1.3% of the files.
>
> Looking at the cue data, there were 11,118 lines that contained -->. There
> were 8830 lines of only whitespace.
>
> In the cue data, if I look at valid-looking timing lines
> (/^\d\d:\d\d:\d\d\.\d\d\d\s*-->\s*\d\d:\d\d:\d\d\.\d\d\d(\s|$)/) and check
> the line before that, or the line before *that* if it looks like an SRT id
> (/^\d+\s*$/), then I see 7030 lines of only whitespace and 3761 lines of
> something else.
>
> Failing to separate cues results in an unpleasant experience for the user,
> since basically the screen is filled with several "cues" including their IDs
> and timing lines.
>
> Some files had most or all of their cues parsed as a single cue with the
> WebVTT parser, e.g. because all lines ended with one or more spaces. Looking
> at such a file in a text editor, it's not immediately obvious that there's
> an error, because the spaces are not visible. Moreover, the file is not
> non-conforming, so a validator wouldn't help either.
>
> So what about the cases that aren't whitespace? It seems to be mostly just
> missing the newline completely. Some omitted the ID also. One file had a "|"
> between all cues.
>
> --
> Simon Pieters
> Opera Software
>

Received on Monday, 24 October 2011 13:50:43 UTC