- From: Simon Pieters <simonp@opera.com>
- Date: Wed, 05 Oct 2011 19:22:51 +0200
I did some research on authoring errors in SRT timestamps to inform whether WebVTT parsing of timestamps should be changed. Our starting point was 70,000 files provided to Opera (for research purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We are not allowed to share the files. Filtering out files that don't contain "-->" leaved 65,000 files. Grepping for lines that contain "-->" resulted in 52,000,000 lines (which should represent roughly the total number of cues). Of those, there were 31,900 lines that are invalid, i.e. don't match the python regexp '\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'. Those are categorized as follows. Note that a line can belong to several categories (except for "none of the above"): hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+' 57 hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+' 834 minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+' 16 minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+' 11 seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)' 889 seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}' 154 decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)' 2085 decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}' 62 decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)' 132 minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+' 6 seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+' 184 leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+' 599 trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])' 532 colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+' 26 dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+' 25372 comma instead of colon '\d+,\d+[:\.,]\d+' 82 dot instead of colon '\d+\.\d+[:\.,]\d+' 41 id before timestamp '^\s*\d+\s+\d+[:\.,]\d+' 115 spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not '(\d+[:\.,]){2,3}\d+' 922 too long arrow '\d\s*-{3,}>\s*\d' 326 none of the above 969 The most common error is to use a dot instead of a comma. Some appear to be a different format, and some appear to be just garbage. Too few or too many hours might not technically be an error, however it appeared that some of too many hours were cases where the line between the id and the timestamp was missing (and no whitespace between), e.g.: 34500:24:01,000 --> 00:24:03,000 The trailing garbage is mostly the line between the timestamp and the cue text being missing, e.g.: 00:00:01,000 --> 00:00:03,000Hello. -- Simon Pieters Opera Software
Received on Wednesday, 5 October 2011 10:22:51 UTC