Re: How to play back synthesized 22kHz audio in a glitch-free manner? from Joseph Berkovitz on 2013-06-18 (public-audio@w3.org from April to June 2013)

From: Joseph Berkovitz <joe@noteflight.com>
Date: Tue, 18 Jun 2013 15:26:40 -0400
To: Jukka Jylänki <jujjyl@gmail.com>
Cc: Jer Noble <jer.noble@apple.com>, Chris Rogers <crogers@google.com>, Kevin Gadd <kevin.gadd@gmail.com>, "Robert O'Callahan" <robert@ocallahan.org>, "public-audio@w3.org" <public-audio@w3.org>
Message-Id: <D62304DD-E9EC-431F-8654-468AC29AAB5A@noteflight.com>
Thanks for the very complete reply. I think this adds a lot of clarity to what's being requested.

With respect to use cases only -- and you raised other important points as well (like push vs. pull, FP vs. integer, and more)  -- I think the key new element here is: "applications that load/receive/synthesize buffers of audio at *different* frequencies directly in JavaScript and need them to be played back as one continuous high-quality stream". The emphasis is on *different* frequencies: in other words, different from the prevailing AudioContext sample rate. That is a case that is not being handled by the API today.


On Jun 18, 2013, at 3:01 PM, Jukka Jylänki <jujjyl@gmail.com> wrote:

> Two examples of the use case were demonstrated in the first post I made (the other is even a fully running application, not just a toy demo). Other use cases include implementing streamed audio playback, music players, software music synthesizers, VOIP calls, audio synchronized to video and games, and other applications that load/receive/synthesize buffers of audio at different frequencies directly in JavaScript and need them to be played back as one continuous high-quality stream.
> 
> Here is an example proposal of an addition to the spec:
> 
> In https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html#AudioBufferSourceNode , add to the AudioBufferSourceNode interface
> 
> double startImmediatelyAfter(AudioBufferSourceNode predecessor);
> 
> "The startImmediatelyAfter method
> 
> Schedules a sound to be played back with a seamless join to the given predecessor sound.
> 
> Use this function to guarantee a continuous glitch-free join of the preceding and this sound source nodes. This sound buffer will be timed to start playing immediately after its predecessor finishes. Both sounds buffers must contain identical number of sound channels, and their sampling rates and playback rates must be identical. Neither of the sound source nodes may be looping.
> 
> An exception MUST be thrown if the predecessor node was not queued for playback, that is, neither start or startImmediatelyAfter methods have been called on the predecessor node, or if stop() has been called on the predecessor node. If the predecessor node has already finished its playback, this source node will start its playback immediately.
> 
> An exception MUST be thrown if this source node and the predecessor source node are not connected to the same destination, or if the predecessor source node is not connected to any destination.
> 
> Any given source node may be used only once as a predecessor for another source node. If a node is specified as a predecessor twice, an exception MUST be thrown.
> 
> The methods start or startImmediatelyAfter may only be called one time on a source node, and only one of these methods may be called on a source node.
> 
> This function returns the time (in seconds) this sound is scheduled to start playing. It is in the same time coordinate system as AudioContext.currentTime. "
> 
> That would allow a push model for feeding continuous audio buffers to the device, and the return value of the function will enable measuring over-/underbuffering. This functionality is more or less identical to what XAudio2, OpenAL, Mozilla Audio Data API, DirectShow and DirectSound and (most likely all other native audio libraries I haven't used) implement. Also, this will let the JS application control both buffer sizing (fixed or variable, if needed) and scheduling on when new fill is needed, and will enable a flexible non-millisecond-critical way to push new audio.
> 
> ScriptProcessorNode offers a pull model, but it is constrained compared to a push model. If one misses a sample (no data is available) when the callback fires, there is no way (to my knowledge) to feed the data immediately when it becomes available, but one must wait until the next callback period, which is always a multiple of the block size. In a push model, data can be fed immediately as available. For example for an application that uses buffers of 2048 samples in size at 22kHz, missing data in an audio callback with ScriptProcessorNode will cause a 2048/22050 = 93 msec delay until the next callback fires, but in a push model, if the data was made available in between this period, it could be played immediately. 93 msecs is a long pause.
> 
> Also, ScriptProcessorNode requires a constant buffer size, but some applications may fetch data, synthesize or decode in variable block sizes. Additionally, it does not currenty allow specifying the sample playback frequency but requires that one is able to synthesize in the native device frequency, which the spec doesn't even specify, so in practice, implementors would be required to be able to synthesize in an arbitrary rate from 22kHz to 96kHz the browser reports supporting, or be able to implement resampling on their own. It is already worrying that Web Audio API supports only Float32 format, and that JS code needs to implement per-sample U8/S8/U16/S16/U24/S24/etc. -> Float32 format conversions, which should definitely be the task of the C/C++ SSE-optimized signal processing code. (I would hurry to add support for these formats to the spec as well, but that's another story). Forcing users to write signal resamplers in JS would be even more catastrophic.
> 
> Ideally, one would like to see both a pull and push model being supported with a strong capability set, but if only one would be chosen, the push model is superior. If anyone argues that start(double when) is designed to be sample-precise, already the very fact that we are discussing the question of floating point precision here makes it smelly, and a better API that allows an explicit contract is needed, because
> - it is, well, explicit,
> - it is easier to program against (there have now been at least three attempts in the Emscripten community to do buffer queueing, none of which got it right the first, or even second time),
> - it is easier to implement (neither Chrome or Firefox nightly currently produce glitch-free buffer joins), and
> - it does not require a mathematical proof based on http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html to convince that FP computation won't produce a drift along time to miss a sample.
> 
> I do not see the web being a special case compared to native world, and why the web audio specs could not just stick to the tried-and-true solutions that the native world audio APIs have offered for well more than 15 years. The current spec gives the idea of 'web is only float 32-bit, likely-48-but-can-vary-by-browser-kHz', which to me is not good enough. Solutions like Emscripten try to blur the native-web boundary, but it's difficult to do so, if even the most modern web specs settle to 'almost-there' level.
> 
> Alternative solutions to the above spec proposal are of course welcome (be it enabling setting the device playback rate, or a push variant queue node of ScriptProcessorNode, or similar), but whatever is finally decided, I hope that when the final spec is released, there is an example application shipped with the spec that demonstrates how e.g. 16-bit, 22kHz stereo audio is synthesized and streamed continuously and guaranteed to be glitch-free.
> 
> 
> 2013/6/18 Jer Noble <jer.noble@apple.com>
> 
> On Jun 18, 2013, at 10:24 AM, Chris Rogers <crogers@google.com> wrote:
> 
>> 
>> 
>> 
>> On Tue, Jun 18, 2013 at 8:55 AM, Jer Noble <jer.noble@apple.com> wrote:
>> 
>> On Jun 18, 2013, at 6:55 AM, Joe Berkovitz <joe@noteflight.com> wrote:
>> 
>>> Actually, as co-editor of the use case document I am very interested in understanding why the arbitrary concatenation of buffers is important. When would this technique be used by a game? Is this for stitching together prerecorded backgrounds?
>>> 
>> 
>> Here's a good example of such a use case: http://labs.echonest.com/Uploader/index.html
>> 
>> The WebAudio app slices an uploaded piece of music into discrete chunks, calculates paths between similar chunks, and "stitches" together an inifintely long rendidion of the song by jumping in the timeline between similar chunks.
>> 
>> This app is currently implements its queueing model by calling setTimeout(n), where n is 10ms before the anticipated end time of the current sample. However, this causes stuttering and gaps whenever the timer is late by more than 10ms. WebKit Nightlies implement JavaScript timer coalescing when pages are not visible, which has lead the Infinite Jukebox page to pause playback when it gets a 'visibilitychange'/'hidden' event.
>> 
>> A lookahead scheduling of 10ms is a bit optimistic.  Chris Wilson has written an excellent article about this topic:
>> http://www.html5rocks.com/en/tutorials/audio/scheduling/
>>  
> 
> Even so, timer coalescing can delay timers by very large amounts (perhaps even 1000ms!) so even some of the techniques Chris mentions in that article will fail unless very large lookahead queues are built up.  
> 
> For stiching together separate AudioBuffers seamlessly, having a buffer queue node available would be much more preferable to having web authors implement their own queuing model.
> 
> -Jer
> 
> 

.            .       .    .  . ...Joe

Joe Berkovitz
President

Noteflight LLC
Boston, Mass.
phone: +1 978 314 6271
www.noteflight.com
"Your music, everywhere"
Received on Tuesday, 18 June 2013 19:27:10 UTC