[whatwg] SRT research: timestamps from Simon Pieters on 2011-10-06 (public-whatwg-archive@w3.org from October 2011)

From: Simon Pieters <simonp@opera.com>
Date: Thu, 06 Oct 2011 10:27:41 +0200
Message-ID: <op.v2w8gff7idj3kv@simon-pieterss-macbook.local>
On Wed, 05 Oct 2011 23:07:17 +0200, Silvia Pfeiffer  
<silviapfeiffer1 at gmail.com> wrote:

> On Thu, Oct 6, 2011 at 4:22 AM, Simon Pieters <simonp at opera.com> wrote:
>> I did some research on authoring errors in SRT timestamps to inform  
>> whether
>> WebVTT parsing of timestamps should be changed.
>>
>> Our starting point was 70,000 files provided to Opera (for research
>> purposes) by opensubtitles.org (thanks!) supposedly being SRT files. We  
>> are
>> not allowed to share the files.
>>
>> Filtering out files that don't contain "-->" leaved 65,000 files.
>>
>> Grepping for lines that contain "-->" resulted in 52,000,000 lines  
>> (which
>> should represent roughly the total number of cues). Of those, there were
>> 31,900 lines that are invalid, i.e. don't match the python regexp
>> '\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d\s*-->\s*\d\d:[0-5]\d:[0-5]\d\,\d\d\d($|\s)'.

Forgot to mention here that this regexp used re.match rather than  
re.search, which basically means that a leading '^' is implied.

>> Those are categorized as follows. Note that a line can belong to several
>> categories (except for "none of the above"):
>>
>>
>> hours too few '(^|\s|>)\d[:\.,]\d+[:\.,]\d+'
>> 57
>> hours too many '(^|\s|>)\d{3,}[:\.,]\d+[:\.,]\d+'
>> 834
>
> IIUC this means there are more than 2 characters used for the hours. I
> think that's a bug of your regex then. There was always going to be
> more than 99 hours possible and WebVTT Timestamps are no different:
> http://www.whatwg.org/specs/web-apps/current-work/webvtt.html#webvtt-timestamp
> . It says "two or more characters...".

Right. However, since movies are seldom longer than 99 hours, I figured  
that it was worth inspecting to see what kinds of mistakes were hidden  
there.


>> minutes too few '(^|\s|>)\d+[:\.,]\d[:\.,]\d+'
>> 16
>> minutes too many '(^|\s|>)\d+[:\.,]\d{3,}[:\.,]\d+'
>> 11
>> seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'
>> 889
>> seconds too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d{3,}'
>> 154
>> decimals too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'
>> 2085
>> decimals too many '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{4,}'
>> 62
>> decimals missing '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+(\s|$|-)'
>> 132
>> minutes gt 59 '(^|\s|>)\d+[:\.,]0{0,}[6-9]\d+[:\.,]\d+'
>> 6
>
> That's small.
>
>> seconds gt 59 '(^|\s|>)\d+[:\.,]\d+[:\.,]0{0,}[6-9]\d+'
>> 184
>
> That's fairly small, in particular considering that spaces in
> timestamps or an elongated arrow create a lot more problems.

What problems?


>> leading garbage '^[^\s\d]+\d+[:\.,]\d+[:\.,]\d+'
>> 599
>> trailing garbage '-->\s*(\d+[:\.,]){2,3}\d+(\s+[^\s]|[^\s\d:\.,])'
>> 532
>> colon instead of comma '\d+[:\.,]\d+[:\.,]\d+[:\.,]\d+:\d+'
>> 26
>> dot instead of comma '\d+[:\.,]\d+[:\.,]\d+\.\d+'
>> 25372
>> comma instead of colon '\d+,\d+[:\.,]\d+'
>> 82
>> dot instead of colon '\d+\.\d+[:\.,]\d+'
>> 41
>> id before timestamp '^\s*\d+\s+\d+[:\.,]\d+'
>> 115
>> spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not
>> '(\d+[:\.,]){2,3}\d+'
>> 922
>> too long arrow '\d\s*-{3,}>\s*\d'
>> 326
>> none of the above
>> 969
>>
>>
>> The most common error is to use a dot instead of a comma.
>
> They're WebVTT files already. ;-)

Unlikely. :-)

>
>> Some appear to be a different format, and some appear to be just  
>> garbage.
>>
>> Too few or too many hours might not technically be an error, however it
>> appeared that some of too many hours were cases where the line between  
>> the
>> id and the timestamp was missing (and no whitespace between), e.g.:
>>
>> 34500:24:01,000 --> 00:24:03,000
>>
>> The trailing garbage is mostly the line between the timestamp and the  
>> cue
>> text being missing, e.g.:
>>
>> 00:00:01,000 --> 00:00:03,000Hello.
>
> So we have a lot more errors coming from missing new lines than from
> mis-authoring the hour, minute or seconds number? That's encouraging.
> The only common number mistake seems to be to make the decimals
> shorter than 3 numbers. Maybe we can resolve this by just having a
> rule for what that should be interpreted as?

That's still is very rare in this sample: 2,085/52,000,000 ? 0.004% of all  
cues.

-- 
Simon Pieters
Opera Software
Received on Thursday, 6 October 2011 01:27:41 UTC