- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Thu, 6 Oct 2011 08:07:17 +1100
On Thu, Oct 6, 2011 at 4:22 AM, Simon Pieters <simonp at opera.com> wrote: > I did some research on authoring errors in SRT timestamps to inform whether > WebVTT parsing of timestamps should be changed. > > Our starting point was 70,000 files provided to Opera (for research > purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We are > not allowed to share the files. > > Filtering out files that don't contain "-->" leaved 65,000 files. > > Grepping for lines that contain "-->" resulted in 52,000,000 lines (which > should represent roughly the total number of cues). Of those, there were > 31,900 lines that are invalid, i.e. don't match the python regexp > '\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'. > > Those are categorized as follows. Note that a line can belong to several > categories (except for "none of the above"): > > > hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+' > 57 > hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+' > 834 IIUC this means there are more than 2 characters used for the hours. I think that's a bug of your regex then. There was always going to be more than 99 hours possible and WebVTT Timestamps are no different: http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#webvtt-timestamp . It says "two or more characters...". > minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+' > 16 > minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+' > 11 > seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)' > 889 > seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}' > 154 > decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)' > 2085 > decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}' > 62 > decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)' > 132 > minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+' > 6 That's small. > seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+' > 184 That's fairly small, in particular considering that spaces in timestamps or an elongated arrow create a lot more problems. > leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+' > 599 > trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])' > 532 > colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+' > 26 > dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+' > 25372 > comma instead of colon '\d+,\d+[:\.,]\d+' > 82 > dot instead of colon '\d+\.\d+[:\.,]\d+' > 41 > id before timestamp '^\s*\d+\s+\d+[:\.,]\d+' > 115 > spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not > '(\d+[:\.,]){2,3}\d+' > 922 > too long arrow '\d\s*-{3,}>\s*\d' > 326 > none of the above > 969 > > > The most common error is to use a dot instead of a comma. They're WebVTT files already. ;-) > Some appear to be a different format, and some appear to be just garbage. > > Too few or too many hours might not technically be an error, however it > appeared that some of too many hours were cases where the line between the > id and the timestamp was missing (and no whitespace between), e.g.: > > 34500:24:01,000 --> 00:24:03,000 > > The trailing garbage is mostly the line between the timestamp and the cue > text being missing, e.g.: > > 00:00:01,000 --> 00:00:03,000Hello. So we have a lot more errors coming from missing new lines than from mis-authoring the hour, minute or seconds number? That's encouraging. The only common number mistake seems to be to make the decimals shorter than 3 numbers. Maybe we can resolve this by just having a rule for what that should be interpreted as? Cheers, Silvia.
Received on Wednesday, 5 October 2011 14:07:17 UTC