- From: Simon Pieters <simonp@opera.com>
- Date: Thu, 06 Oct 2011 10:27:41 +0200
On Wed, 05 Oct 2011 23:07:17 +0200, Silvia Pfeiffer <silviapfeiffer1 at gmail.com> wrote: > On Thu, Oct 6, 2011 at 4:22 AM, Simon Pieters <simonp at opera.com> wrote: >> I did some research on authoring errors in SRT timestamps to inform >> whether >> WebVTT parsing of timestamps should be changed. >> >> Our starting point was 70,000 files provided to Opera (for research >> purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We >> are >> not allowed to share the files. >> >> Filtering out files that don't contain "-->" leaved 65,000 files. >> >> Grepping for lines that contain "-->" resulted in 52,000,000 lines >> (which >> should represent roughly the total number of cues). Of those, there were >> 31,900 lines that are invalid, i.e. don't match the python regexp >> '\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'. Forgot to mention here that this regexp used re.match rather than re.search, which basically means that a leading '^' is implied. >> Those are categorized as follows. Note that a line can belong to several >> categories (except for "none of the above"): >> >> >> hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+' >> 57 >> hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+' >> 834 > > IIUC this means there are more than 2 characters used for the hours. I > think that's a bug of your regex then. There was always going to be > more than 99 hours possible and WebVTT Timestamps are no different: > http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#webvtt-timestamp > . It says "two or more characters...". Right. However, since movies are seldom longer than 99 hours, I figured that it was worth inspecting to see what kinds of mistakes were hidden there. >> minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+' >> 16 >> minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+' >> 11 >> seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)' >> 889 >> seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}' >> 154 >> decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)' >> 2085 >> decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}' >> 62 >> decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)' >> 132 >> minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+' >> 6 > > That's small. > >> seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+' >> 184 > > That's fairly small, in particular considering that spaces in > timestamps or an elongated arrow create a lot more problems. What problems? >> leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+' >> 599 >> trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])' >> 532 >> colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+' >> 26 >> dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+' >> 25372 >> comma instead of colon '\d+,\d+[:\.,]\d+' >> 82 >> dot instead of colon '\d+\.\d+[:\.,]\d+' >> 41 >> id before timestamp '^\s*\d+\s+\d+[:\.,]\d+' >> 115 >> spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not >> '(\d+[:\.,]){2,3}\d+' >> 922 >> too long arrow '\d\s*-{3,}>\s*\d' >> 326 >> none of the above >> 969 >> >> >> The most common error is to use a dot instead of a comma. > > They're WebVTT files already. ;-) Unlikely. :-) > >> Some appear to be a different format, and some appear to be just >> garbage. >> >> Too few or too many hours might not technically be an error, however it >> appeared that some of too many hours were cases where the line between >> the >> id and the timestamp was missing (and no whitespace between), e.g.: >> >> 34500:24:01,000 --> 00:24:03,000 >> >> The trailing garbage is mostly the line between the timestamp and the >> cue >> text being missing, e.g.: >> >> 00:00:01,000 --> 00:00:03,000Hello. > > So we have a lot more errors coming from missing new lines than from > mis-authoring the hour, minute or seconds number? That's encouraging. > The only common number mistake seems to be to make the decimals > shorter than 3 numbers. Maybe we can resolve this by just having a > rule for what that should be interpreted as? That's still is very rare in this sample: 2,085/52,000,000 ? 0.004% of all cues. -- Simon Pieters Opera Software
Received on Thursday, 6 October 2011 01:27:41 UTC