Re: Starting from Srikumar Karaikudi Subramanian on 2013-05-09 (public-audio@w3.org from April to June 2013)

From: Srikumar Karaikudi Subramanian <srikumarks@gmail.com>
Date: Thu, 9 May 2013 18:06:27 +0530
To: Ehsan Akhgari <ehsan.akhgari@gmail.com>
Cc: Joseph Berkovitz <joe@noteflight.com>, Chris Rogers <crogers@google.com>, Stuart Memo <stuartmemo@gmail.com>, "public-audio@w3.org" <public-audio@w3.org>
Message-Id: <FEAD0D15-C9CB-41A5-9C17-B1743BDAF75C@gmail.com>
> How much of that is implementable in the practice across the wide range of the available devices and operating systems in existence today and conceivable in the future?

What is useful to take from that design is the notion of what a "time stamp" is. "Time stamp" should mean "a snapshot of all the clocks". While "a snapshot of all the clocks" seems to demand the impossible -- who knows how many clocks the system is running? -- we only need a snapshot to be of enough number of clocks, taken at the right time, to infer all the others. In practice, this means that a time stamp necessarily includes a high enough resolution host time like performance.now(). In the language of the Khronos spec, I'm saying a time stamp is at least "UST+MSC".

Such a definition of "time stamp" simplifies communicating what a particular time value means. For example, in the current design, it is difficult for a developer to infer what "playbackTime" would mean in a graph like "ScriptNode1 -> DelayNode -> ScriptNode2 -> Destination". If we used proper time stamps, then the time stamp of the output buffer in a script node would be "the earliest time at which the audio can leave the system's output jack" (for whatever definition of "output jack").

Now, it is (obviously) not enough to just say "we have currentTime and performance.now(), just call them separately when you need them". For synchronization, we need the system time at which the "currentTime" property took on that particular value. The problem becomes clear if the audio system is made to run with a ridiculous buffer size like 10 seconds -- "currentTime" would change only every 10 seconds. It is surely possible to produce synchronized audio and visuals with that kind of a buffer size, but without complete time stamp information, that would be impossible. A buffer size of 1024 is already longer than one frame duration at 60fps, preventing tight synchronization. If 4k is used, we're dead.

One example - in Apple's CoreAudio, the AudioTimeStamp structure is an aggregate of multiple time coordinate systems. The time stamps given in audio callbacks in CoreAudio include a high resolution host time as well (OSX+iOS), which helps synchronize drawing in CADisplayLink callbacks.

Best,
-Kumar

On 9 May, 2013, at 8:33 AM, Ehsan Akhgari <ehsan.akhgari@gmail.com> wrote:

> 
> On Tue, May 7, 2013 at 10:01 PM, Srikumar Karaikudi Subramanian <srikumarks@gmail.com> wrote:
> 
> On 7 May, 2013, at 6:58 PM, Joseph Berkovitz <joe@noteflight.com> wrote:
>> 
>> playbackTime isn't something that is "accurate" or "inaccurate".  playbackTime is completely deterministic since it describes a sample block's time relationship with other schedulable sources in the graph, not the actual time at which the sample block is heard. So it has nothing to do with buffering. The value of playbackTime in general must advance by (bufferSize/sampleRate) in each successive call, unless blocks of samples are being skipped outright by the implementation to play catch-up for some reason.
>> 
>> Of course any schedulable source whatsoever has a risk of being delayed or omitted from the physical output stream due to unexpected event handling latencies. Thus, playbackTime (like the argument to AudioBufferSourceNode.start()) is a prediction but not a guarantee. The job of the implementation is to minimize this risk by various buffering strategies, but this does not require any ad-hoc adjustments to playbackTime.
> 
> Many years ago when I was looking at audio-visual synchronization approaches for another system, one of the easiest to understand approaches I found was the "UST/MSC/SBC" approach described in the Khronos OpenML documents [1]. In essence, it says (according to my understanding) that every signal coming into the computing system is time stamped w.r.t. when it arrived on some "input jack", and every computed signal intended to leave the system is time stamped w.r.t. when it will leave the respective "output jack". This holds for both video and audio signals. 
> 
> Whether a signal actually leaves the system at the stamped time is up to the scheduler and the other system constraints, but from the perspective of the process computing the signal, it has done its job once the time stamp is set.
> 
> UST/MSC/SBC may serve as an adequate framework for explaining the various time stamps in the system and the relationships between them, as well as provide an API to the various schedulers. We already have to deal with three now - graphics, audio samples and MIDI events.
> 
> How much of that is implementable in the practice across the wide range of the available devices and operating systems in existence today and conceivable in the future?
> 
> --
> Ehsan
> <http://ehsanakhgari.org/>
>
Received on Thursday, 9 May 2013 12:37:04 UTC