AudioNode API Review - Part 1 (StandingWave3 Comparison) from Joseph Berkovitz on 2010-10-04 (public-xg-audio@w3.org from October 2010)

From: Joseph Berkovitz <joe@noteflight.com>
Date: Mon, 4 Oct 2010 17:47:43 -0400
To: public-xg-audio@w3.org
Message-Id: <E86E640C-9274-4256-813E-19F75E304A10@noteflight.com>
Hi folks,

Thanks for the great call today. I am very enthusiastic about the work  
this group has taken on, and about its present direction.

As a new member I want to do my best to respond to the Web Audio API  
Proposal with some initial observations.  I know that a lot of  
thinking has gone into the current draft, so please forgive my  
ignorance of the many points that have already discussed -- I'm  
admitting now to having only briefly skimmed the mailing list  
archives.  I did try out the sample code, though, and read some of  
it.  Very impressive!

I'm going to break my response up into two main parts.  Part 1 (this  
one) will be a high-level comparison of the Web Audio API Proposal  
with StandingWave 3, the internal synthesis engine used by Noteflight  
and available on GitHub as the standingwave3 project.  Part 2 (to  
follow in the next day or two) will be a commentary on various aspects  
of the Web Audio API, responding to the draft on a feature-by-feature  
basis.

-----------------------------------

COMPARISON OF THE WEB AUDIO API ("Web API") WITH STANDINGWAVE 3 ("SW3")

My goal in this writeup is to highlight and compare different  
approaches to problems taken by the two libraries, with the aim of  
stimulating some thought and discussion. In general, I'm not going to  
focus on the things that are very similar, since there are so many  
points of similarity.  I'm also not going to go through the many great  
things that the Web API does which SW3 omits -- obviously we should  
keep those unless they feel extra.

Let me say at the outset that I am not looking for the literal  
adoption of SW3 concepts within the Web API.  If the group feels that  
there is some advantage in using some ideas from SW3 within the Web  
API, that could be valuable. And if the comparison moves the group to  
feel that the API is just fine as it is, I consider that just as  
valuable an outcome.

I will call out essential features for Noteflight's use cases with a  
preceding triplet of asterisks (***).


RESOURCES

The SW3 API documentation can be browsed here:
      http://blog.noteflight.com/projects/standingwave3/doc/

Noteflight is here (the best-known Standingwave app, though not the  
only one)
      http://www.noteflight.com

FUNDAMENTALS

Requirements: SW3 was designed to support the synthesis of Western  
musical styles from semantic music notation, using an instrument  
sample library, applying effects and filters as needed for  
ornamentation and interpretation. But it was also intended to serve as  
an all-purpose package for Flash sound developers, and does take a  
general approach to many issues. I believe it would be possible to  
write a number of the Web API demo apps using SW3, where capabilities  
overlap.

Underlying Approach: SW3 is written in ActionScript 3 with low level  
DSP processing in C.  The role of nodes is similar in both packages,  
allowing low-level constructs in C to be encapsulated out of view from  
the application programmer who works at a scripting-language level.   
Our finding with SW3 has been that nodes are a very useful way of  
surfacing audio synthesis and that application builders are able to  
work with them effectively, but that there is a learning curve for  
people who aren't used to programming with declarative objects in this  
way.


EFFECTS AND FILTERS

Loop Points***: SW3 allows a single loop point to be specified for an  
audio buffer.  This means that the loop "goes back" to a particular  
nonzero sample index after the end is reached.  This feature is really  
essential for wavetable synthesis, since one is commonly seeking to  
produce a simulated note of indefinite length by looping a rather  
featureless portion of an actual note being played, a portion that  
must follow the initial attack.

Resampling / Pitch Shifting: SW3 uses an explicit filter node  
(ResamplingFilter) which resamples its input at an arbitrary sampling  
rate. This allows any audio source to be speeded up/slowed down  
(making its overall duration shorter/longer).  Contrast this with the  
Web API, in which AudioBufferSourceNode "bakes in" resampling, via the  
playbackRate attribute.  It appears that in the Web API no composite  
source or subgraph can be resampled.  Now, the Web API approach would  
actually be sufficient for Noteflight's needs (since we only apply  
resampling directly to audio buffers) but it's worth asking whether  
breaking this function out as a filter is useful.

Looping-as-Effect: SW3 also breaks out looping as an explicit filter  
node, allowing any composite source to be looped.


SEQUENCING***

SW3 uses a very different approach to time-sequencing of audio  
playback to the Web API's noteOn(when) approach.  I feel that each  
approach has distinct strengths and weaknesses.  This is probably the  
biggest architectural difference between the projects.

In SW3, all sources always are considered to begin generating a signal  
at time zero.  Consequently, there is no triggering construct such as  
noteOn() at all. Instead, SW3 provides a special scheduling object  
called a Performance, which aggregates and schedules a list of sources  
to start at specific onset times. The output of the Performance is  
thus a mixdown of all the sources scheduled within it, each delayed to  
start at the correct time.  It's a bit like this:

     var node1 = ... // create some audio source starting nominally at  
t=0
     var node2 = ... // create some other source, also at t=0
     var perf = new Performance();
     perf.addNode(node1, 1.0);      // schedule node1 to occur at  
offset 1.0 sec within Performance
     perf.addNode(node2, 2.0);      // schedule node2 to occur at  
offset 2.0 sec within Performance
     context.destination = performance;   // play the performance  
consisting of the scheduled sources

Note that a Performance is *just an audio source*.  Thus, you can play  
a Performance just like a standalone source, pipe a Performance into  
some other graph of filters/effects, or even schedule it into a higher- 
level Performance made out of shorter sub-Performances.  Performances  
can also be looped meaning that they repeatedly schedule their  
contents, which is better for a longish run of material than rendering  
into a buffer and looping the buffer.  You might say a Performance is  
sort of a smart time-aware mixer that maintains an internal index.  It  
efficiently ignores inputs that haven't started yet, or which have  
already stopped.  You can schedule stuff into a Performance while it's  
playing, too.

So... isn't this functionally equivalent to having many different  
sources in the Web API that are routed into a mixer or destination,  
and calling noteOn() on the head end of each source with a different  
value?  Well no, not exactly.  The difference has to do with how many  
objects you have to schedule, and how complicated it gets to figure  
out what time values to pass to these invocations, and what knowledge  
you need in order to schedule a composite subgraph to occur at some  
point.  This is where I think the problem lies in the current API draft.

Consider that a single musical note in a typical wavetable synth is a  
subgraph that might include stuff like this:
- a basic audio sample, possibly more than one for a composite sound
- an amplifier
- an envelope modulation that controls the amp gain
- a low-pass filter
- an envelope modulation that controls the filter center frequency
- other assorted modulations for musical expression (e.g. vibrato,  
whammy bar, whatever)

So a single note is a little subgraph with moving parts, which require  
their own independent scheduling in time (though they are coordinated  
with each other in a predictable way).  The sample has to be timed,  
and the modulations in particular are going to have customized onsets  
and durations for each note, based on the duration, volume,  
articulation and ornamentation of the note.

Now... in the Web API approach, one has to calculate the onset time of  
each time-dependent element of the subgraph and individually schedule  
it, adding an overall start time into each one so that each call to  
noteOn()/scheduleAutomation() references an absolute time rather than  
a time offset relative to the start of the note.  In other words...  
key point coming up here... *scheduling an audio subgraph at a  
specific point in time requires knowledge of its internals* in order  
to schedule its bits and pieces to occur in coordination with each  
other.

Compare this with the Performance approach, in which you have some  
factory code that simply makes the right subgraph, without any  
scheduling information at all.  It gets scheduled as a whole, outside  
of the code that made it, by other code that is only concerned with  
when some event should happen.

The net result is that in the Web API approach, if you want to  
encapsulate knowledge of a subgraph's internals, you have to pass an  
onset time into the code that makes that subgraph.  This doesn't seem  
good to me because it conflates the construction and the scheduling of  
a complex sound.  I am still thinking about what to recommend instead  
(other than just adding a Performance-like construct to the Web API),  
but would first like to hear others' reaction to this point.


MODULATORS***

SW3 has a concept of Modulators, whereas the Web API uses an  
AudioCurve (being fleshed out by Chris in real time as I type).  SW3  
Modulators have some normalized value at each sample frame.  They are  
used to modulate pitch (i.e. resampling rate) and gain at various  
places in the synthesis pipeline.  Due to the performance overhead and  
complexity of allowing *any* parameter to be continuously modulated,  
SW3 only allows Modulators to be plugged into certain key parameter  
types of certain nodes, typically gain, pitch-shift or frequency  
parameters.  We needed to make our tight loops as tight as possible  
without checking to see if some variable needs to change its value on  
each iteration.

SW3 currently supports just two kinds of modulators: piecewise-linear  
(where you supply a list of time/value tuples) and envelope generators  
(ADHSR).  LFOs would be great but SW3 doesn't have LFOs per se,  
instead we use a piecewise-linear modulator as a triangle/sawtooth/ 
square-wave source.  An ADHSR (Attack/Decay/Hold/Sustain/Release)  
modulator is particularly important since it supplies a musical shape  
to a note.

Since many modulating functions with a psychoacoustic or musical  
result are linear in the *log* of the parameter, not in the parameter  
itself, the interpretation of the modulator often requires an implicit  
exponentiation layered on top of a piecewise-linear modulator.  For  
instance, one might want an LFO that makes a tone wobble between a  
semitone below and a semitone above.  This LFO would have a range from  
-k to +k (no pitch change = 0), but the corresponding pitch shift  
would use a multiplicative factor ranging between exp(-k) and exp(k)  
(no pitch change = exp(0) = 1).


------

That's it -- I'll try to get Part 2 together shortly.  In the  
meantime, I hope this is helpful.

... .  .    .       Joe

Joe Berkovitz
President
Noteflight LLC
160 Sidney St
Cambridge, MA 02139
phone: +1 978 314 6271
Received on Monday, 4 October 2010 21:48:29 UTC