Re: Why we need SMPTE timecodes to do frame-accurate processing from Dave Singer on 2009-01-14 (public-media-fragment@w3.org from January 2009)

From: Dave Singer <singer@apple.com>
Date: Wed, 14 Jan 2009 14:50:38 -0800
To: Jack Jansen <Jack.Jansen@cwi.nl>, Media Fragment <public-media-fragment@w3.org>
Message-Id: <p0624081fc594184ced79@[17.202.35.52]>

some comments here.

At 14:25  +0100 14/01/09, Jack Jansen wrote:
>As promised, here's a short explanation why we having only seconds 
>(or microseconds, etc)-based time is not good enough for 
>frame-accurateselection of video content.
>
>SMPTE timecodes come in a number of flavors, among them SMPTE-24 
>(from film, originally, 24fps) and SMPTE-25 (from PAL television). 
>Most of these flavors are easy to convert from and to linear time, 
>because they are themselves linear and monotonous. I.e. SMPTE-24 
>hh:mm:ss:ff can be converted to seconds by doing 
>hh*3600+mm*60+ss+ff/24.
>
>The problem starts with NTSC timecodes. NTSC is commonly thought of 
>as 30 frames per second, but its actually 29.97 frames per second.

Actually, that's not quite right either.  The frame duration is 
actually 1001 ticks of a 30,000 per second clock (29.97003).

>The difference is There are two common ways to solve this issue:
>
>1. smpte-30 ignores the problem, it just says "there's 30 frames in 
>every second".

Well, if you are doing true 30 fps material, then SMPTE-30 is not 
ignoring the problem, it is the correct labelling.

>So, consecutive frames are numbered consecutively. Conversion 
>between timestamps and (milli)seconds is just as easy as for 
>smpte-24. However, there is a playback problem: 30 frames should 
>play back not in 1.000 second but in 1.001 second. Ignoring this is 
>not an option, especially not when there is an audio track too: if 
>we have a 44Khz audio track we should play back 44044 audio samples 
>in the same time as we play back 30 video frames. If we don't do 
>this then by the time we're 15 minutes into a presentation, audio 
>and video will be out-of-sync by 1 second. Not everyone can spot 
>off-by-one-frame sync errors, but off-by-one-second is clearly too 
>much:-)
>
>2. smpte-30drop fixes the problem with a solution similar to leap 
>years: at the beginning of every minute *except if the minute is 
>divisible by 10* there are no frames 00 and 01. So, it's not frames 
>that are dropped, but numbers. So, after frame 00:00:59:29 we get 
>frame 00:01:00:02. But, after frame 00:09:59:29 we get 00:10:00:00.
>Now we can blissfully ignore the audio/video sync problem and over 
>the course of 10 minutes audio and video will slowly drift apart, 
>but at most 2 frames.

I think it should be clear that audio-video *sync* is within the 
media engine and not really to do with fragments, right?

Overall, I don't mind using SMPTE time-codes as one possible way to 
index into a time-line,  but they really should only be used if the 
content actually contains embedded definitive time-codes.  Guessing 
the zero-origin of the SMPTE time-code is not more accurate than 
using simple wall-clock time.

Wall-clock time is perfectly diagnostic of a time in a presentation, 
and can be expressed with enough accuracy to be frame-accurate, 
ideally by using a rational.  (e.g. if I say please start at time 
190+(3003/30000) seconds in, for NTSC material, it's clear I am 
showing starting at the end of the 3rd frame (i.e. the 4th frame) at 
time 3minutes 10seconds in...

This area is further complicated if you drop the assumption that 
frame rates (and hence durations) are constant, and also by wanting 
sample-accuracy in audio (does anyone remember the sampling rates 
used by the original Mac?).
-- 
David Singer
Multimedia Standards, Apple Inc.

Received on Wednesday, 14 January 2009 22:53:27 UTC