[whatwg] SRT research: timestamps

I did some research on authoring errors in SRT timestamps to inform  
whether WebVTT parsing of timestamps should be changed.

Our starting point was 70,000 files provided to Opera (for research  
purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We  
are not allowed to share the files.

Filtering out files that don't contain "-->" leaved 65,000 files.

Grepping for lines that contain "-->" resulted in 52,000,000 lines (which  
should represent roughly the total number of cues). Of those, there were  
31,900 lines that are invalid, i.e. don't match the python regexp  
'\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'.

Those are categorized as follows. Note that a line can belong to several  
categories (except for "none of the above"):


hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+'
57
hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'
834
minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+'
16
minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+'
11
seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'
889
seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}'
154
decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'
2085
decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}'
62
decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)'
132
minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+'
6
seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+'
184
leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+'
599
trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])'
532
colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+'
26
dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+'
25372
comma instead of colon '\d+,\d+[:\.,]\d+'
82
dot instead of colon '\d+\.\d+[:\.,]\d+'
41
id before timestamp '^\s*\d+\s+\d+[:\.,]\d+'
115
spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not  
'(\d+[:\.,]){2,3}\d+'
922
too long arrow '\d\s*-{3,}>\s*\d'
326
none of the above
969


The most common error is to use a dot instead of a comma.

Some appear to be a different format, and some appear to be just garbage.

Too few or too many hours might not technically be an error, however it  
appeared that some of too many hours were cases where the line between the  
id and the timestamp was missing (and no whitespace between), e.g.:

34500:24:01,000 --> 00:24:03,000

The trailing garbage is mostly the line between the timestamp and the cue  
text being missing, e.g.:

00:00:01,000 --> 00:00:03,000Hello.

-- 
Simon Pieters
Opera Software

Received on Wednesday, 5 October 2011 10:22:51 UTC