[whatwg] SRT research: timestamps

On Thu, Oct 6, 2011 at 4:22 AM, Simon Pieters <simonp at opera.com> wrote:
> I did some research on authoring errors in SRT timestamps to inform whether
> WebVTT parsing of timestamps should be changed.
>
> Our starting point was 70,000 files provided to Opera (for research
> purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We are
> not allowed to share the files.
>
> Filtering out files that don't contain "-->" leaved 65,000 files.
>
> Grepping for lines that contain "-->" resulted in 52,000,000 lines (which
> should represent roughly the total number of cues). Of those, there were
> 31,900 lines that are invalid, i.e. don't match the python regexp
> '\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'.
>
> Those are categorized as follows. Note that a line can belong to several
> categories (except for "none of the above"):
>
>
> hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+'
> 57
> hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'
> 834

IIUC this means there are more than 2 characters used for the hours. I
think that's a bug of your regex then. There was always going to be
more than 99 hours possible and WebVTT Timestamps are no different:
http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#webvtt-timestamp
. It says "two or more characters...".


> minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+'
> 16
> minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+'
> 11
> seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'
> 889
> seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}'
> 154
> decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'
> 2085
> decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}'
> 62
> decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)'
> 132
> minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+'
> 6

That's small.

> seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+'
> 184

That's fairly small, in particular considering that spaces in
timestamps or an elongated arrow create a lot more problems.

> leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+'
> 599
> trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])'
> 532
> colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+'
> 26
> dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+'
> 25372
> comma instead of colon '\d+,\d+[:\.,]\d+'
> 82
> dot instead of colon '\d+\.\d+[:\.,]\d+'
> 41
> id before timestamp '^\s*\d+\s+\d+[:\.,]\d+'
> 115
> spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not
> '(\d+[:\.,]){2,3}\d+'
> 922
> too long arrow '\d\s*-{3,}>\s*\d'
> 326
> none of the above
> 969
>
>
> The most common error is to use a dot instead of a comma.

They're WebVTT files already. ;-)


> Some appear to be a different format, and some appear to be just garbage.
>
> Too few or too many hours might not technically be an error, however it
> appeared that some of too many hours were cases where the line between the
> id and the timestamp was missing (and no whitespace between), e.g.:
>
> 34500:24:01,000 --> 00:24:03,000
>
> The trailing garbage is mostly the line between the timestamp and the cue
> text being missing, e.g.:
>
> 00:00:01,000 --> 00:00:03,000Hello.

So we have a lot more errors coming from missing new lines than from
mis-authoring the hour, minute or seconds number? That's encouraging.
The only common number mistake seems to be to make the decimals
shorter than 3 numbers. Maybe we can resolve this by just having a
rule for what that should be interpreted as?

Cheers,
Silvia.

Received on Wednesday, 5 October 2011 14:07:17 UTC