Re: Why we need SMPTE timecodes to do frame-accurate processing

On Thu, 15 Jan 2009, Jack Jansen wrote:

>
> On  14-Jan-2009, at 23:50 , Dave Singer wrote:
>
>> some comments here.
>
> And some comments from me, again.
>
> Dave: you obviously know a lot more about the subject than I do, so don't 
> hesitate to point out any problems with my reasoning.
>
> Also, as an introduction: I wrote this email as a reaction to something that 
> came up at the face-to-face: can't we just do microsecond timestamps only, 
> and simply convert any other timestamps (if we allow them in, say, the URL 
> fragment specifier) to microseconds? When we standardised clipBegin/clipEnd 
> for SMIL this same question came up, so I summarized what I know about the 
> subject.

If you know which version of smpte is in use, then you might be able to 
reverse the algorithm to compute the starting frame based on the time 
selected (and identify its true time), no?

The burden is on the server, to do this mapping, however it avoids having 
too many units for the URI definition of the fragment.

>> At 14:25  +0100 14/01/09, Jack Jansen wrote:
>>> As promised, here's a short explanation why we having only seconds (or 
>>> microseconds, etc)-based time is not good enough for 
>>> frame-accurateselection of video content.
>>> 
>>> SMPTE timecodes come in a number of flavors, among them SMPTE-24 (from 
>>> film, originally, 24fps) and SMPTE-25 (from PAL television). Most of these 
>>> flavors are easy to convert from and to linear time, because they are 
>>> themselves linear and monotonous. I.e. SMPTE-24 hh:mm:ss:ff can be 
>>> converted to seconds by doing hh*3600+mm*60+ss+ff/24.
>>> 
>>> The problem starts with NTSC timecodes. NTSC is commonly thought of as 30 
>>> frames per second, but its actually 29.97 frames per second.
>> 
>> Actually, that's not quite right either.  The frame duration is actually 
>> 1001 ticks of a 30,000 per second clock (29.97003).
>
> Agreed.
>
>>> The difference is There are two common ways to solve this issue:
>>> 
>>> 1. smpte-30 ignores the problem, it just says "there's 30 frames in every 
>>> second".
>> 
>> Well, if you are doing true 30 fps material, then SMPTE-30 is not ignoring 
>> the problem, it is the correct labelling.
>
> Agreed.
>
>>> So, consecutive frames are numbered consecutively. Conversion between 
>>> timestamps and (milli)seconds is just as easy as for smpte-24. However, 
>>> there is a playback problem: 30 frames should play back not in 1.000 
>>> second but in 1.001 second. Ignoring this is not an option, especially not 
>>> when there is an audio track too: if we have a 44Khz audio track we should 
>>> play back 44044 audio samples in the same time as we play back 30 video 
>>> frames. If we don't do this then by the time we're 15 minutes into a 
>>> presentation, audio and video will be out-of-sync by 1 second. Not 
>>> everyone can spot off-by-one-frame sync errors, but off-by-one-second is 
>>> clearly too much:-)
>>> 
>>> 2. smpte-30drop fixes the problem with a solution similar to leap years: 
>>> at the beginning of every minute *except if the minute is divisible by 10* 
>>> there are no frames 00 and 01. So, it's not frames that are dropped, but 
>>> numbers. So, after frame 00:00:59:29 we get frame 00:01:00:02. But, after 
>>> frame 00:09:59:29 we get 00:10:00:00.
>>> Now we can blissfully ignore the audio/video sync problem and over the 
>>> course of 10 minutes audio and video will slowly drift apart, but at most 
>>> 2 frames.
>> 
>> I think it should be clear that audio-video *sync* is within the media 
>> engine and not really to do with fragments, right?
>> 
>> 
>> 
>> Overall, I don't mind using SMPTE time-codes as one possible way to index 
>> into a time-line,  but they really should only be used if the content 
>> actually contains embedded definitive time-codes.  Guessing the zero-origin 
>> of the SMPTE time-code is not more accurate than using simple wall-clock 
>> time.
>
> Absolutely agreed. A point that I did not make in my original message is that 
> frame accuracy is only achieved if when whoever creates the URL fragment 
> specifier does so after inspection of the original media, and uses the exact 
> same encoding for the fragment as is used in the media stream.
>
>> Wall-clock time is perfectly diagnostic of a time in a presentation, and 
>> can be expressed with enough accuracy to be frame-accurate, ideally by 
>> using a rational.  (e.g. if I say please start at time 190+(3003/30000) 
>> seconds in, for NTSC material, it's clear I am showing starting at the end 
>> of the 3rd frame (i.e. the 4th frame) at time 3minutes 10seconds in...
>
> Interesting... I had not thought about using rationals. (I do think that the 
> assertion "wall clock time is good enough" hinges on the availability of 
> rationals for specifying the time). Indeed, if rationals are good enough for 
> QuickTime they must be good enough for us:-)
>
>> This area is further complicated if you drop the assumption that frame 
>> rates (and hence durations) are constant, and also by wanting 
>> sample-accuracy in audio (does anyone remember the sampling rates used by 
>> the original Mac?).
>
> If frame rates can vary then the only thing that can give frame accuracy (I 
> think) is SMPTE timecodes, and those only if each frame has a timecode. Now 
> timecodes become identifiers, really.
>
> If we want to do frame accuracy for audio samples (and we don't want to go to 
> 64 bit integers) then rationals are the only option.
> --
> Jack Jansen, <Jack.Jansen@cwi.nl>, http://www.cwi.nl/~jack
> If I can't dance I don't want to be part of your revolution -- Emma Goldman
>
>
>

-- 
Baroula que barouleras, au tiéu toujou t'entourneras.

         ~~Yves

Received on Monday, 19 January 2009 13:44:58 UTC