Re: Why we need SMPTE timecodes to do frame-accurate processing from Jack Jansen on 2009-01-15 (public-media-fragment@w3.org from January 2009)

From: Jack Jansen <Jack.Jansen@cwi.nl>
Date: Thu, 15 Jan 2009 22:07:58 +0100
To: Dave Singer <singer@apple.com>
Cc: Media Fragment <public-media-fragment@w3.org>
Message-Id: <B26D6F7B-98DB-4025-BBC0-491C632C2E98@cwi.nl>
On  14-Jan-2009, at 23:50 , Dave Singer wrote:

> some comments here.

And some comments from me, again.

Dave: you obviously know a lot more about the subject than I do, so  
don't hesitate to point out any problems with my reasoning.

Also, as an introduction: I wrote this email as a reaction to  
something that came up at the face-to-face: can't we just do  
microsecond timestamps only, and simply convert any other timestamps  
(if we allow them in, say, the URL fragment specifier) to  
microseconds? When we standardised clipBegin/clipEnd for SMIL this  
same question came up, so I summarized what I know about the subject.
>
>
> At 14:25  +0100 14/01/09, Jack Jansen wrote:
>> As promised, here's a short explanation why we having only seconds  
>> (or microseconds, etc)-based time is not good enough for frame- 
>> accurateselection of video content.
>>
>> SMPTE timecodes come in a number of flavors, among them SMPTE-24  
>> (from film, originally, 24fps) and SMPTE-25 (from PAL television).  
>> Most of these flavors are easy to convert from and to linear time,  
>> because they are themselves linear and monotonous. I.e. SMPTE-24  
>> hh:mm:ss:ff can be converted to seconds by doing hh*3600+mm*60+ss 
>> +ff/24.
>>
>> The problem starts with NTSC timecodes. NTSC is commonly thought of  
>> as 30 frames per second, but its actually 29.97 frames per second.
>
> Actually, that's not quite right either.  The frame duration is  
> actually 1001 ticks of a 30,000 per second clock (29.97003).

Agreed.

>> The difference is There are two common ways to solve this issue:
>>
>> 1. smpte-30 ignores the problem, it just says "there's 30 frames in  
>> every second".
>
> Well, if you are doing true 30 fps material, then SMPTE-30 is not  
> ignoring the problem, it is the correct labelling.

Agreed.

>> So, consecutive frames are numbered consecutively. Conversion  
>> between timestamps and (milli)seconds is just as easy as for  
>> smpte-24. However, there is a playback problem: 30 frames should  
>> play back not in 1.000 second but in 1.001 second. Ignoring this is  
>> not an option, especially not when there is an audio track too: if  
>> we have a 44Khz audio track we should play back 44044 audio samples  
>> in the same time as we play back 30 video frames. If we don't do  
>> this then by the time we're 15 minutes into a presentation, audio  
>> and video will be out-of-sync by 1 second. Not everyone can spot  
>> off-by-one-frame sync errors, but off-by-one-second is clearly too  
>> much:-)
>>
>> 2. smpte-30drop fixes the problem with a solution similar to leap  
>> years: at the beginning of every minute *except if the minute is  
>> divisible by 10* there are no frames 00 and 01. So, it's not frames  
>> that are dropped, but numbers. So, after frame 00:00:59:29 we get  
>> frame 00:01:00:02. But, after frame 00:09:59:29 we get 00:10:00:00.
>> Now we can blissfully ignore the audio/video sync problem and over  
>> the course of 10 minutes audio and video will slowly drift apart,  
>> but at most 2 frames.
>
> I think it should be clear that audio-video *sync* is within the  
> media engine and not really to do with fragments, right?
>
>
>
> Overall, I don't mind using SMPTE time-codes as one possible way to  
> index into a time-line,  but they really should only be used if the  
> content actually contains embedded definitive time-codes.  Guessing  
> the zero-origin of the SMPTE time-code is not more accurate than  
> using simple wall-clock time.

Absolutely agreed. A point that I did not make in my original message  
is that frame accuracy is only achieved if when whoever creates the  
URL fragment specifier does so after inspection of the original media,  
and uses the exact same encoding for the fragment as is used in the  
media stream.

> Wall-clock time is perfectly diagnostic of a time in a presentation,  
> and can be expressed with enough accuracy to be frame-accurate,  
> ideally by using a rational.  (e.g. if I say please start at time  
> 190+(3003/30000) seconds in, for NTSC material, it's clear I am  
> showing starting at the end of the 3rd frame (i.e. the 4th frame) at  
> time 3minutes 10seconds in...

Interesting... I had not thought about using rationals. (I do think  
that the assertion "wall clock time is good enough" hinges on the  
availability of rationals for specifying the time). Indeed, if  
rationals are good enough for QuickTime they must be good enough for  
us:-)

> This area is further complicated if you drop the assumption that  
> frame rates (and hence durations) are constant, and also by wanting  
> sample-accuracy in audio (does anyone remember the sampling rates  
> used by the original Mac?).

If frame rates can vary then the only thing that can give frame  
accuracy (I think) is SMPTE timecodes, and those only if each frame  
has a timecode. Now timecodes become identifiers, really.

If we want to do frame accuracy for audio samples (and we don't want  
to go to 64 bit integers) then rationals are the only option.
--
Jack Jansen, <Jack.Jansen@cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma  
Goldman
Received on Thursday, 15 January 2009 21:08:41 UTC