Web Audio API Feedback from Spaceport.io

About a month ago, I reached out to Ben and Matt from http://spaceport.io/.
Spaceport a mobile gaming company providing an SDK that allows people to
build games with JavaScript and using Flash for the audio.

I asked them if they would be kind enough to review one of the specs. They
wanted to review the Web Audio API, and took the time to go through it in
some detail.

I wanted to thank Matt for the detailed and thoughtful contribution, and
encourage people to discuss the comments.

There may be some great points in here that we can track or add/refine our
use-cases.

--
Alistair


*Spaceport’s goals for audio on the webUser-driven disabling/permissionLike
the geolocation, fullscreen, and pointer lock API’s, the Web Audio API must
account for user and browser vendor concerns such as:

   - Spoofing
   - Resource limitations (bandwidth, CPU, memory, etc.)
   - Annoyance/embarassment


The API should make it obvious that the browser may cancel sound playback
at any time (including sound loading), and the API should make the program
aware when/if this occurs (e.g. in the form of error events to complement
success events).

One of our main goals is for Apple to include support for the HTML5 audio
spec in mobile safari.  Our assumption is that until the user has the
ability to grant permission to a page to play audio - Apple will find this
to be a poor quality user experience and continue to disable sound until a
user intentionally clicks a “play” button.  We presume that they are
concerned about advertisements on webpages that may begin playing annoying
sounds to a user who has not authorized sound playback.

A confounding factor here is that there may be multiple parties responsible
for authoring content on a page.  For example - in the Facebook canvas -
there are advertisements to the right hand side, with games running within
an iFrame in the center.  How can a user grant permission to the game to
play sound, and Facebook to play a “ding” sound when a new notification
appears, but to disable sound permissions from advertisements?

As an internet user, I would personally be very annoyed to visit a webpage
with my iPad just to hear it suddenly start playing sound.  This problem is
more severe since mobile devices are often used in public places where you
might not want others to hear noises coming from your device.
Audio performance measurementFor benchmarking and for optimizing software,
audio latency (and other performance metrics) must be measurable.

See our “perf-marks” report which we periodically release to measure the
speed at which each browser implements the HTML5 spec.
Synchronization of audio and video(Video here can refer to dynamic
animations, not strictly <video> elements.  Audio here can refer to
synthesized samples, not strictly <audio> elements.)  For example - imagine
a game like “Talking Tom” where a user touches locations on the screen and
sees a short animated sequence paired with an audio clip.  In short -
synching the programatic animations (perhaps using canvas / webGL / dom
manipulation) with audio sequences is very important.

Given that video is the authority, audio should be able to be triggered
(seeked, started, resumed) as soon as possible (with a latency of less than
17 milliseconds).  The video should be notified of any audio problems (e.g.
latency, high CPU usage) in order for the program to perform correction (of
either audio or video).

Given that audio is the authority, the program should be notified on
specific audio events (sound started or stopped playing, time/sample
reached) in order for the program to perform correction (of either audio or
video).

Example applications:

   - Voice acting (lip syncing)
   - Rhythm games

API draft commentaryIn general, the API feels like the fixed-function
pipeline of olde OpenGL.  Developers are given a small set of functionality
with which they can transform sound.  Programmatic capabilities are very
limiting; developers cannot implement many of the existing AudioNode types
using the JavaScriptAudioNode API.  Developers will find it difficult to
output audio to buffers, files, and the network.

While an initial review DOES seem to indicate that most common pieces of
functionality will be possible to implement, it lacks the elegance and
extensibility of an OpenGL 2.0 like approach.  Perhaps in a later release
web-developers could write “audio-shaders” (filters are probably a better
name) in some DSL that could be compiled and run on the sound card in the
same way that shaders are written in GLSL and passed down to the graphics
card.  Perhaps in this type of approach one would not need to provide
AudioNodes like the AudioGain, and AudioPanner, etc AudioNodes, and instead
would only need to support a single AudioNode that could implement all of
the above.
4.1. The AudioContext InterfaceWhere is it stated that there is a global
AudioContext constructor?

It seems odd that the sample rate is a read-only property of the
AudioContext.  It seems like each node should be able to specify its
desired sample rate of processing, and the AudioDestinationNode controls
the default (because it is the native sound device (?)).  Can I not
resample?  Can I not mix a 22khz .wav and a 44.1khz .mp3 and a 48khz
generated sample?  Why can I create AudioBuffer objects with a different
sample rate if it explicitly not going to work properly?

Why is there no currentSample getter?  One of the goals in 1.1. Features is
"Sample-accurate scheduled sound playback".  Dealing with floating-point
arithmetic can lead to sample-accurate calculations.  From minimal testing
in C with double floating-point precision, this doesn't seem to be much of
a problem, but I still feel it is something to be concerned about.

Ben brought up a good point about createBuffer(ArrayBuffer, boolean): how
is the data in the buffer interpreted?  Is it read as an array of Float32's
which are interleaved for channels?  How is the number of channels
determined?  It's very unclear what exactly the buffer data is or should be.

createBuffer(ArrayBuffer, boolean) and decodeAudioData seem to be very
similar (i.e. they accomplish the same thing).  Ben suggests that
createBuffer(ArrayBuffer, boolean) be renamed to decodeAudioDataSync to
show this similarity.  (It then becomes clear that mixToMono is present in
the synchronous version but not in the asynchronous version.)

Is it possible to have more than one AudioContext?  It's stated that this
is not common - but what is the expected behaviour if you do have more than
1?  What if an ad-provider has sound in their advert, and I have sound in
my game (within Facebook frame) and Facebook has a "ding" sound in their
code notifying me of incoming chat messages.  Presumably all 3 will
separately instantiate an AudioContext.  How will/should these interact?

Because the number of inputs and outputs is constant for each node, what
happens if there is an unfilled input and output is needed, or if there is
an unfilled output and input is given?

For createJavaScriptNode, both Ben and I think it makes sense that you
should be able to add source and destination nodes as needed, and that the
number of inputs and outputs does not need to be specified at the time of
creation.  See our comments on 4.12. The JavaScriptAudioNode Interface for
more information.
4.2 The AudioNode InterfaceAre AudioNodes forced to form a directed acyclic
graph?  What happens if there is a cycle in the graph?

Based upon my reading of the “connect” method there is a concept of
“ordering” with respect to the inputs and outputs of an AudioNode.  These
appear to be a zero-based index system.  It seems to me that if this is the
case there ought to be a good way to iterate over the inputs / outputs.

for( var i = 0; i < audioNode.inputs.length; i++ ){

var input = audioNode.inputs[i];

if( input ){

 // do I need to check and see if this input is valid?

 // I wouldn’t have to do this if it wasn’t a sparse array

 // do something with input

}

}

for( var i = 0; i < audioNode.outputs.length; i++ ){

var output = audioNode.outputs[i];

if( output ){

 // do I need to check and see if this output is valid?

 // I wouldn’t have to do this if it wasn’t a sparse array

 // do something with output

}

}

I’m certain that at some point I will want to do something like this - and
there doesn’t seem to be an explicit way.

Also, if I remove an input / output do the other inputs / outputs shift
downward?  None of this is explicit in this interface.

Is this a sparse array?  If I were to create an audio node with no inputs
or outputs and then call connect with the output set to something other
than zero would that be valid?  Is there a “max-inputs” or “max-outputs”
value?

Is numberOfInputs / numberOfOutputs constant, or variable as I call connect
/ disconnect?  If it is constant, how can I determine which slots have
valid inputs / outputs.  Perhaps one could add a audioNode.hasInputAt(i),
and hasOutputAt(i) set of methods?

Why can’t I disconnect an input?  Disconnect only seems to work on outputs.
 Finding the input node and the appropriate index of the output from that
node could be a bear.  Now think about this: “It is possible to connect an
AudioNode output to more than one input with multiple calls to connect().
Thus, "fanout" is supported” If the input to my audioNode has more than one
output (it’s a fanout) then I cannot with this spec disconnect this
connection without disconnecting all of the other connections.

Matt suggests that the inputs and outputs be encapsulated in collections,
which can be iterated upon and manipulated.  Alternatively, filters can be
made distinct from fusers, splitters, sources, and sinks.
4.4. The AudioDestinationNode InterfaceI assume there is only one
implementation of AudioDestinationNode (a speaker), and it is created when
you create a new AudioContext.  This does make sense to me, but I don't see
why it was done this way.

I would expect there to be a global list of sound devices (connected
microphones, sound outputs (headphones, speakers), etc.) which I manually
route.  The example in 1.2. Modular Routing could read:

var context = new AudioContext();

assert(context.destination == null);

context.destination = AudioDevices.defaultOutput;

function playSound() {

   var source = context.createBufferSource();

   source.buffer = dogBarkingBuffer;

   source.connect(context.destination);

   source.noteOn(0);

}

In other words, can I mix into something other than a speaker?  What if I
just want to mix sound and then ship it off to the internet, or to disk, or
to another buffer?

Is it an error to not specify a destination for a node?
4.5. The AudioParam InterfaceIt seems AudioParam is a prime candidate for
JavaScript control.  It doesn't seem I can easily implement tremolo (or
vibrato) using the current API.

In short - people will never be happy and will always demand more and more
methods.  Linear and Exponential are great, but why not just let someone
specify and arbitrary function pointer that returns a value?  This allows a
far better level of control.
4.6 AudioGain “the nominal maxValue is 1.0”.  What? does that mean the
“default value” is 1.0?  Do values greater than 1 increase volume, or just
clamp to 1.0?  The “no exception thrown” doesn’t tell me what behaviour to
expect.
4.9. The AudioBuffer InterfaceShouldn't audio gain be a separate AudioNode
type?  What makes gain special here?  I think I know the answer (range
limited in the buffer for efficiency), but it's not explained at all.

What happens when getChannelData is called with an out-of-range channel
(e.g. channel 1 where numberOfChannels === 1).
4.10. The AudioBufferSourceNode InterfaceIt seems noteOn(x) is identical to
noteGrainOn(x, 0, Infinity).  Can the two functions be merged?

How does the loop property interact with noteGrainOn?  Is only the selected
“grain” looping?

What happens if I change the loop property while data is playing?  What
about if buffer changes?
4.12. The JavaScriptAudioNode InterfaceThe text:

   numberOfInputs  : 1
   numberOfOutputs : 1

is confusing, especially because a few paragraphs later the spec says the
number of inputs and outputs can be variable:

numberOfInputChannels and numberOfOutputChannels determine the number of
input and output channels. It is invalid for both numberOfInputChannels and
numberOfOutputChannels to be zero.

AudioProcessingEvent does not give you multiple input or output buffers, so
it seems the API for JavaScriptAudioNode requires only one input and one
output channel (exactly).

Can JavaScriptAudioNode not work with other threads (i.e. WebWorkers)?
 Using a WebWorker would be the first thing I’d try with
JavaScriptAudioNode.
4.15. The AudioListener InterfaceWhy is AudioListener not an AudioNode?  It
seems odd to special-case this type.
9. Channel up-mixing and down-mixingThe following paths of up-mixing have
different results:

1 -> 5.1
1 -> 2 -> 5.1

Is this intended behaviour?

No paths for down-mixing a 5.1 layout are defined.
10. Event SchedulingIt’s very concerning that this section is incomplete,
especially because two of our goals (performance measurement and
synchronization) demand events.*






-- 
Alistair MacDonald
SignedOn, Inc - W3C Audio WG
Boston, MA, (707) 701-3730
al@signedon.com - http://signedon.com

Received on Tuesday, 27 March 2012 21:19:29 UTC