Re: Web Audio API Proposal from Chris Rogers on 2010-07-02 (public-xg-audio@w3.org from July 2010)

From: Chris Rogers <crogers@google.com>
Date: Fri, 2 Jul 2010 13:36:58 -0700
To: Ricard Marxer Piñón <ricardmp@gmail.com>
Cc: Chris Marrin <cmarrin@apple.com>, Jer Noble <jer.noble@apple.com>, public-xg-audio@w3.org
Message-ID: <AANLkTiko2IW5WPbExJ5LdD3kZhKzHthxCFol0sbvCZ3C@mail.gmail.com>
Hi Ricard,

Thanks for your interest in the graph-based (node) approach.  I really
appreciate your comments, and will try to address your questions/ideas the
best I can:

On Thu, Jul 1, 2010 at 11:18 AM, Ricard Marxer Piñón <ricardmp@gmail.com>wrote:
>
> AudioPannerNode + AudioListener:
> Maybe I'm wrong, but I think these nodes perform some processes that
> are quite tied to data (HRTF) or that may be implemented in many
> different ways that could lead to different outputs depending on the
> method. Maybe they could be broken up into smaller blocks that have a
> much more defined behavior and let the user of the API specify what
> data to use or what algorithm to implement.
>

The approach I took with AudioPannerNode was to define a common interface
used by all the panning models, for the source/panner position, orientation,
velocity, and cone settings, distance model, etc.
These are the attributes which are commonly used in current 3D game engines
for spatializing/panning.

Then I defined constants for different panning approaches:

        const unsigned short PASSTHROUGH = 0;
        const unsigned short EQUALPOWER = 1;
        const unsigned short HRTF = 2;
        const unsigned short SOUNDFIELD = 3;
        const unsigned short MATRIXMIX = 4;

In looking back at this now, I realize that  MATRIXMIX (and arguably
PASSTHROUGH) does not really belong here and should be a different type of
AudioNode, since it ignores position, orientation, etc.
But, EQUALPOWER (vector-based panning), HRTF, and SOUNDFIELD all make use of
the common attributes such as position, orientation, etc.

Instead of defining a 'panningModel' attribute, it would also be possible to
subclass AudioPannerNode with these three types.  Then we would have:

EqualPowerPannerNode (with very mathematically precise behavior)

SoundFieldPannerNode (with very mathematically precise behavior)

HRTFPannerNode
Here's my take on the HRTF (spatialization data sets):
The browser would be free to use any generic HRTF data set here.  Browsers
vary in many specific ways such as the exact fonts and anti-aliasing
algorithms used to render text, the exact algorithms used to
resample/re-scale images in both <img> and <canvas>, and the audio
resampling algorithms used currently for <audio> , so I think it's OK to not
specify the exact HRTF data set.
We could have a method to optionally set a custom HRTF data set measured
from a specific person.  This would be an advanced and very rare use case I
think and would require us to define a data format.  In principle, I think
having this method is fine, but I would defer it to a more advanced
implementation.  As long as we get the base API correct, then we can always
add this method later.  I think it would be a bad idea to always require the
javascript developer to specify a specific URL for the HRTF data set,
because these files are very large and would incur large download costs.
 Web pages today, have built-in default fonts for rendering text and don't
require downloading fonts just to render text.  Similarly, I think there
should be a default HRTF (spatialization) data set which would automatically
be used.



>
> ConvolverNode
> The convolver node has an attribute that is an AudioBuffer.  I think
> it should just have a float array with the impulse response or
> multiple float arrays if we want to convolve differently the different
> channels.  The fact of having the AudioBuffer could make the user
> believe that the impulse response would adapt to different sample
> rates, which doesn't seem to be the case.
>

Effectively, an AudioBuffer is a list of Float32Arrays (one for each
channel).  And I've recently added direct buffer access with the
getChannelData() method, so javascript can generate/modify the buffers.  I
wouldn't worry about the sample-rate too much.  We may be able to remove
this attribute from AudioBuffer entirely if we can assume that the entire
audio rendering graph is operating at the same sample-rate and all
AudioBuffer objects also implicitly have this sample-rate.  Otherwise, if we
keep sampleRate, then we can define the behavior such that a sample-rate
conversion automatically happens if necessary, or require that the
sample-rate match the ConvolverNodes's sample-rate.



> This is a quite important node because it will be used for many
> different tasks.  It's behavior should be clearly defined.


I agree, and you're right in pointing out that we need to add more detail
about the exact behavior.


>  Can the
> user modify the impulse response on the fly (must the filter keep the
> past N samples in memory for this)?


There are some technical challenges to modifying the impulse response on the
fly.  The convolution is applied using FFT block-processing and in
multiple-threads.  This is a detail of implementation, but must be
considered in actual practice, since direct convolution is very much less
efficient and not feasible.  Because the processing is block-based, it's
possible to introduce glitches into the processed audio stream when
modifiying the impulse-response in real-time.  Some of these problems can be
minimized by changing the impulse response slowly and only one block (time
segment) at a time.  The current state-of-the-art for desktop audio
convolution engines does allow modifying the impulse responses in real-time
with fancy user-interfaces.  It would be interesting to be able to create
these interfaces in canvas or WebGL!  I would like to keep this possibility
open, and make sure the API is flexible enough to add this feature.  That
said, I would also like the API to be fairly simple in the common use case,
and wouldn't necessarily expect an initial implementation to have special
engine support for glitch-free impulse response editing.

 Does the impulse response have a
> limit in length?  Should the user set the maximum length of the
> impulse response at the beginning?
>

This is a great question.  The longer the impulse response (and the more
channels), the more CPU-intensive it becomes.  A very very long impulse
response that might work fine on a desktop machine, might have trouble on a
mobile device.  This is a scalability issue similar to what we already face
with the graphics APIs.  With WebGL, it's easily possible to draw way way
too much stuff for either the javascript itself or the GPU to handle at
anything near a reasonable frame-rate.


>
> RealtimeAnalyserNode
> From my POV this node should be replaced by a FftNode.  The FFT is not
> only used for audio visualization but for many audio
> analysis/processing/synthesis methods (transient detection,
> coding/compression, transcription, pitch estimation, classification,
> effects, etc.).  Therefore I think the user should be able to have
> access to a proper FFT, without smoothing, band processing nor
> magnitude scaling (in dBs or in intensity). It should be also possible
> to access the magnitude and phase or the complex values themselves,
> many methods are based on the complex representation.  Additionally I
> would propose the possibility to select the window, frameSize, fftSize
> and hopSize used when performing the FFT.  I would also propose an
> IfftNode that would perform the inverse of this one and the overlap
> and add process to have to full loop and be able to go back to the
> time domain.  I will get back to this once I have the Chris webkit
> branch running.  The implementation of this addition should be trivial
> since most FFT libraries also perform the IFFT.
>

The current RealtimeAnalyserNode API was quickly put together just to get
basic visualizer support.  Whatever we do, I hope this basic case can still
be reasonably simple API-wise if we decide to go with a more elaborate
approach.  Believe me, I understand your interest in doing more by
effectively creating a complete analysis and re-synthesis engine with
arbitrary frequency-domain processing in between.  A long time ago, in a
previous life I worked at IRCAM on SVP (now SuperVP) and wrote the first
version of AudioSculpt for doing exactly these types of transforms.

For analysis, let's see what we can do API-wise to keep the simple cases
simple, but allow for more sophisticated use cases later on.  Like I said,
the current API was very quickly designed, so maybe we can do much better.



>
> AudioParam
> This one is a very tricky one.  Currently parameters are only floats
> and can have a minimum and maximum.  This information is mostly useful
> when automatically creating GUI for nodes or for introspection.  But
> finding a set of informations that can completely describe a parameter
> space is extremely hard.  I would say that the parameter should just
> be a variant value with a description attribute that contains a
> dictionary with some important stuff about the parameter.    The
> description could look somewhat like this (beware of my lack of
> expertise in JS, there surely a better way):
> gain parameter: {'type': 'float', 'min': 0, 'max': 1, 'default': 1,
> 'units': 'intensity', 'description': 'Controls the gain of the
> signal', 'name': 'gain'}
> windowType parameter: {'type': 'enum', 'choices': [RECTANGULAR, HANN,
> HAMMING, BLACKMANHARRIS], 'default': BLACKMANHARRIS, 'name': 'window',
> 'description': 'The window function used before performing the FFT'}
>
> I think this would make it more flexible for future additions to the API.
> I also think that the automation shouldn't belong in the AudioParam
> class, since for some parameter it doesn't make sense to have it.  The
> user can easily perform the automation using JavaScript and since the
> rate of parameter change (~ 100hz) is usually much lower than the
> audio rate (~>8000Hz), there should be no problems with performance.
>

I designed the AudioParam API very much in the same way that I did for
AudioUnits which are used as the plugin model for Mac OS X (and iOS).  I
think it has worked pretty well in a large variety of processing/synthesis
plugins which are sold commercially.  Although it's true that not everything
can be represented by a float, most can be and it's useful to be able to
attach automation curves to these types of objects for implementing
envelopes, volume fades, etc.  Almost all DAW (digital audio workstation)
software has the concept of a timeline where different parameters can be
automated in time.  For the few cases which are not represented by floats,
such as the "impulse response" of the ConvolverNode, it's not too difficult
to have specific attributes on these objects (which are not AudioParams, and
thus not automatable using a simple curve).

I'm not sure that I agree that parameters can always easily be automated
directly in javascript at a rate of (~ 100hz).  Sometimes, parameter changes
need to be scheduled to happen at relatively precise and rhythmically
perfect ways.  The resolution of javascript setTimeout() is not good enough
for these cases, and is not reliable enough to guarantee that parameters
change smoothly without glitches and pauses.  As an example, SuperCollider
has a control rate (krate) which defaults to 64 sample-frames which is (~
1000Hz).

The "automation" attribute of AudioParam is just speculation on my part of
how the API would actually work.  Soon, I hope to implement the automation
directly in the underlying engine code in such a way that we can experiment
with several different javascript API approaches.



> Anyway these are just my 2 cents.  I just had a first look at the API,
> I might come up with more comments once I get my hands on Chris'
> implementation and am able to try it out.
>
> ricard


Thanks Ricard, I really appreciate your ideas and look forward to more
discussions with you on refining the AudioNode approach.

Cheers,
Chris
Received on Friday, 2 July 2010 20:37:31 UTC