Re: Web Audio API Proposal from Chris Marrin on 2010-06-15 (public-xg-audio@w3.org from June 2010)

From: Chris Marrin <cmarrin@apple.com>
Date: Tue, 15 Jun 2010 06:36:16 -0700
To: public-xg-audio@w3.org
Message-id: <95983FBB-D91D-498B-8E90-27E830948362@apple.com>
On Jun 14, 2010, at 8:27 PM, Robert O'Callahan wrote:

> That API looks extremely complicated. It looks like it will be a huge amount of work to get a precise spec with interoperability across diverse implementations.
> 
> Dave Humphrey's proposed API ( https://wiki.mozilla.org/Audio_Data_API ) is far simpler because it leaves almost all audio processing to JS. This "Web Audio API" proposal has a section "Javascript Issues with real-time Processing and Synthesis:" which lists several problems, but they boil down to two underlying issues:
> 1) JS is slower than native code.
> 2) Processing audio on a Web page's "main thread" has latency risks.
> 
> Issue #2 can be addressed by extending the Audio Data API so it can be used from Web Workers.
> 
> For issue #1, there is experimental data showing that many kinds of effects can be done "fast enough" in JS. See the Audio Data API demos, and https://bugzilla.mozilla.org/show_bug.cgi?id=490705#c49 for some performance numbers. Certainly there's still a performance gap between current JS implementations and hand-vectorized code, but it seems to me more profitable to work on addressing that gap directly (e.g. improving JS implementations, or adding vector primitives to JS, or providing a library of standard signal processing routines that work on WebGLArrays, or even NaCl/PNaCl) than hardcoding a ton of audio-specific functionality behind a complex API. The latter approach will be a lot more work, not reusable beyond audio, and always limited, as people find they need specific effects that aren't yet supported in the spec or in the browser(s) they want to deploy on.

I disagree with the characterization of this API as "complicated", requiring a "huge amount of work". You could make the opposite characterization of the Mozilla proposal as being too simplistic to be useful for any real applications across a wide variety of hardware. But I don't think that characterization is accurate either.

I believe the two proposals mark the limits (simple to complex) of what is needed for an audio processing API in a web browser. And I think that is an excellent starting point. The Mozilla proposal would severely limit the types of audio processing possible on many devices, especially mobile devices. Providing native APIs for the most common processing models makes as much sense as providing filters in SVG or spline curves in Canvas.

One of the tasks of this group is to determine what is necessary for a sufficient audio API. I think there are areas of Chris' proposal that could be scaled back for simplicity. But many parts of it are components of a basic minimal set I believe any API we produce must contain.

In my mind this audio API can be separated into 3 parts:

1) Access to audio samples

These nodes give access to the audio from existing media (audio/video) as well as audio buffers created in Javascript, generated or loaded from source media for use as input assets. Importantly, these nodes DO NOT give access to the samples themselves. I believe this is important because it allows very efficient audio processing chains to be created and optimized without the need to expose the underlying details of how buffering occurs.

2) Audio processing

These are nodes that takes audio from one or more sources, processes the samples and outputs them to the next stage of processing. These nodes can be chained and at some point the end of the chain can be realized as audio output or access to the samples. But again, these nodes don't provide access to the samples themselves. Some of the effects here might be hard-coded (like mixers or panners) because their functionality is so common, and others might be done with general purpose effects, such as a convolution filter.

3) Access to the output of the audio processing chain

This is the final stage of processing and can be as simple as sending the output to the speakers, or accessing the audio samples, either directly or through a FFT. This is the only place direct access to the audio samples is possible. This could give access to a down-sampled version of the processed audio, making it possible to process full-quality audio for output, while giving access to audio samples at reasonable data rates for additional processing, such as visualizers.


I feel like your criticism of Chris' spec has mostly to do with section (2). He does have many built-in audio processors and perhaps we can reduce that set to a minimum or come up with a more general way to describe such filters. There is much interesting discussion to come in this area!

Section (1) is the bread and butter of the API. I'm concerned about using DOM events to manage access to audio samples, and you may have noticed that I believe there needs to be some mechanism for processing audio without ever getting access to the samples. But ultimately we need a mechanism for plugging into (and generating) audio samples.

I believe section (3) is a very important one to get right. I see 2 problems with leaving FFT processing to JavaScript. First of all, an FFT is such a standard algorithm it seems like a very reasonable and obvious thing to include in an API. Second, there are some implementations of JavaScript that will not be able to keep up with the processing of 48Khz stereo audio. These implementations will have to reduce the sample rate,which will make the FFT calculations less accurate. And even if an implementation is able to keep up with the data rate, it will leave very little for any other JavaScript to run at any sort of reasonable frame rate.

Anyway, that's my take on where we are and where we need to get to. I look forward to the conversations.

-----
~Chris
cmarrin@apple.com
Received on Tuesday, 15 June 2010 13:36:51 UTC