RE: Web Audio API Feedback from Spaceport.io from Wei, James on 2012-03-28 (public-audio@w3.org from January to March 2012)

From: Wei, James <james.wei@intel.com>
Date: Wed, 28 Mar 2012 01:48:44 +0000
To: Alistair MacDonald <al@signedon.com>, "public-audio@w3.org" <public-audio@w3.org>
CC: "matt@spaceport.io" <matt@spaceport.io>, "ben@sibblingz.com" <ben@sibblingz.com>
Message-ID: <668CBE60026FE04AB5523C7C3CCB00F80742BB@SHSMSX101.ccr.corp.intel.com>
For 4.4 AudioDestinationNode issue, offlineaudiocontext should can help on it.


Best Regards

James


From: Alistair MacDonald [mailto:al@signedon.com]
Sent: Wednesday, March 28, 2012 5:19 AM
To: public-audio@w3.org
Cc: matt@spaceport.io; ben@sibblingz.com
Subject: Web Audio API Feedback from Spaceport.io

About a month ago, I reached out to Ben and Matt from http://spaceport.io/. Spaceport a mobile gaming company providing an SDK that allows people to build games with JavaScript and using Flash for the audio.

I asked them if they would be kind enough to review one of the specs. They wanted to review the Web Audio API, and took the time to go through it in some detail.

I wanted to thank Matt for the detailed and thoughtful contribution, and encourage people to discuss the comments.

There may be some great points in here that we can track or add/refine our use-cases.

--
Alistair


Spaceport's goals for audio on the web
User-driven disabling/permission
Like the geolocation, fullscreen, and pointer lock API's, the Web Audio API must account for user and browser vendor concerns such as:

  *   Spoofing
  *   Resource limitations (bandwidth, CPU, memory, etc.)
  *   Annoyance/embarassment

The API should make it obvious that the browser may cancel sound playback at any time (including sound loading), and the API should make the program aware when/if this occurs (e.g. in the form of error events to complement success events).

One of our main goals is for Apple to include support for the HTML5 audio spec in mobile safari.  Our assumption is that until the user has the ability to grant permission to a page to play audio - Apple will find this to be a poor quality user experience and continue to disable sound until a user intentionally clicks a "play" button.  We presume that they are concerned about advertisements on webpages that may begin playing annoying sounds to a user who has not authorized sound playback.

A confounding factor here is that there may be multiple parties responsible for authoring content on a page.  For example - in the Facebook canvas - there are advertisements to the right hand side, with games running within an iFrame in the center.  How can a user grant permission to the game to play sound, and Facebook to play a "ding" sound when a new notification appears, but to disable sound permissions from advertisements?

As an internet user, I would personally be very annoyed to visit a webpage with my iPad just to hear it suddenly start playing sound.  This problem is more severe since mobile devices are often used in public places where you might not want others to hear noises coming from your device.
Audio performance measurement
For benchmarking and for optimizing software, audio latency (and other performance metrics) must be measurable.

See our "perf-marks" report which we periodically release to measure the speed at which each browser implements the HTML5 spec.
Synchronization of audio and video
(Video here can refer to dynamic animations, not strictly <video> elements.  Audio here can refer to synthesized samples, not strictly <audio> elements.)  For example - imagine a game like "Talking Tom" where a user touches locations on the screen and sees a short animated sequence paired with an audio clip.  In short - synching the programatic animations (perhaps using canvas / webGL / dom manipulation) with audio sequences is very important.

Given that video is the authority, audio should be able to be triggered (seeked, started, resumed) as soon as possible (with a latency of less than 17 milliseconds).  The video should be notified of any audio problems (e.g. latency, high CPU usage) in order for the program to perform correction (of either audio or video).

Given that audio is the authority, the program should be notified on specific audio events (sound started or stopped playing, time/sample reached) in order for the program to perform correction (of either audio or video).

Example applications:

  *   Voice acting (lip syncing)
  *   Rhythm games

API draft commentary
In general, the API feels like the fixed-function pipeline of olde OpenGL.  Developers are given a small set of functionality with which they can transform sound.  Programmatic capabilities are very limiting; developers cannot implement many of the existing AudioNode types using the JavaScriptAudioNode API.  Developers will find it difficult to output audio to buffers, files, and the network.

While an initial review DOES seem to indicate that most common pieces of functionality will be possible to implement, it lacks the elegance and extensibility of an OpenGL 2.0 like approach.  Perhaps in a later release web-developers could write "audio-shaders" (filters are probably a better name) in some DSL that could be compiled and run on the sound card in the same way that shaders are written in GLSL and passed down to the graphics card.  Perhaps in this type of approach one would not need to provide AudioNodes like the AudioGain, and AudioPanner, etc AudioNodes, and instead would only need to support a single AudioNode that could implement all of the above.
4.1. The AudioContext Interface
Where is it stated that there is a global AudioContext constructor?

It seems odd that the sample rate is a read-only property of the AudioContext.  It seems like each node should be able to specify its desired sample rate of processing, and the AudioDestinationNode controls the default (because it is the native sound device (?)).  Can I not resample?  Can I not mix a 22khz .wav and a 44.1khz .mp3 and a 48khz generated sample?  Why can I create AudioBuffer objects with a different sample rate if it explicitly not going to work properly?

Why is there no currentSample getter?  One of the goals in 1.1. Features is "Sample-accurate scheduled sound playback".  Dealing with floating-point arithmetic can lead to sample-accurate calculations.  From minimal testing in C with double floating-point precision, this doesn't seem to be much of a problem, but I still feel it is something to be concerned about.

Ben brought up a good point about createBuffer(ArrayBuffer, boolean): how is the data in the buffer interpreted?  Is it read as an array of Float32's which are interleaved for channels?  How is the number of channels determined?  It's very unclear what exactly the buffer data is or should be.

createBuffer(ArrayBuffer, boolean) and decodeAudioData seem to be very similar (i.e. they accomplish the same thing).  Ben suggests that createBuffer(ArrayBuffer, boolean) be renamed to decodeAudioDataSync to show this similarity.  (It then becomes clear that mixToMono is present in the synchronous version but not in the asynchronous version.)

Is it possible to have more than one AudioContext?  It's stated that this is not common - but what is the expected behaviour if you do have more than 1?  What if an ad-provider has sound in their advert, and I have sound in my game (within Facebook frame) and Facebook has a "ding" sound in their code notifying me of incoming chat messages.  Presumably all 3 will separately instantiate an AudioContext.  How will/should these interact?

Because the number of inputs and outputs is constant for each node, what happens if there is an unfilled input and output is needed, or if there is an unfilled output and input is given?

For createJavaScriptNode, both Ben and I think it makes sense that you should be able to add source and destination nodes as needed, and that the number of inputs and outputs does not need to be specified at the time of creation.  See our comments on 4.12. The JavaScriptAudioNode Interface for more information.
4.2 The AudioNode Interface
Are AudioNodes forced to form a directed acyclic graph?  What happens if there is a cycle in the graph?

Based upon my reading of the "connect" method there is a concept of "ordering" with respect to the inputs and outputs of an AudioNode.  These appear to be a zero-based index system.  It seems to me that if this is the case there ought to be a good way to iterate over the inputs / outputs.

for( var i = 0; i < audioNode.inputs.length; i++ ){

var input = audioNode.inputs[i];

if( input ){

     // do I need to check and see if this input is valid?

     // I wouldn't have to do this if it wasn't a sparse array

     // do something with input

}

}


for( var i = 0; i < audioNode.outputs.length; i++ ){

var output = audioNode.outputs[i];

if( output ){

     // do I need to check and see if this output is valid?

     // I wouldn't have to do this if it wasn't a sparse array


     // do something with output

}

}

I'm certain that at some point I will want to do something like this - and there doesn't seem to be an explicit way.

Also, if I remove an input / output do the other inputs / outputs shift downward?  None of this is explicit in this interface.

Is this a sparse array?  If I were to create an audio node with no inputs or outputs and then call connect with the output set to something other than zero would that be valid?  Is there a "max-inputs" or "max-outputs" value?

Is numberOfInputs / numberOfOutputs constant, or variable as I call connect / disconnect?  If it is constant, how can I determine which slots have valid inputs / outputs.  Perhaps one could add a audioNode.hasInputAt(i), and hasOutputAt(i) set of methods?

Why can't I disconnect an input?  Disconnect only seems to work on outputs.  Finding the input node and the appropriate index of the output from that node could be a bear.  Now think about this: "It is possible to connect an AudioNode output to more than one input with multiple calls to connect(). Thus, "fanout" is supported" If the input to my audioNode has more than one output (it's a fanout) then I cannot with this spec disconnect this connection without disconnecting all of the other connections.

Matt suggests that the inputs and outputs be encapsulated in collections, which can be iterated upon and manipulated.  Alternatively, filters can be made distinct from fusers, splitters, sources, and sinks.
4.4. The AudioDestinationNode Interface
I assume there is only one implementation of AudioDestinationNode (a speaker), and it is created when you create a new AudioContext.  This does make sense to me, but I don't see why it was done this way.

I would expect there to be a global list of sound devices (connected microphones, sound outputs (headphones, speakers), etc.) which I manually route.  The example in 1.2. Modular Routing could read:

var context = new AudioContext();

assert(context.destination == null);

context.destination = AudioDevices.defaultOutput;


function playSound() {

   var source = context.createBufferSource();

   source.buffer = dogBarkingBuffer;

   source.connect(context.destination);

   source.noteOn(0);

}

In other words, can I mix into something other than a speaker?  What if I just want to mix sound and then ship it off to the internet, or to disk, or to another buffer?

Is it an error to not specify a destination for a node?
4.5. The AudioParam Interface
It seems AudioParam is a prime candidate for JavaScript control.  It doesn't seem I can easily implement tremolo (or vibrato) using the current API.

In short - people will never be happy and will always demand more and more methods.  Linear and Exponential are great, but why not just let someone specify and arbitrary function pointer that returns a value?  This allows a far better level of control.
4.6 AudioGain
"the nominal maxValue is 1.0".  What? does that mean the "default value" is 1.0?  Do values greater than 1 increase volume, or just clamp to 1.0?  The "no exception thrown" doesn't tell me what behaviour to expect.
4.9. The AudioBuffer Interface
Shouldn't audio gain be a separate AudioNode type?  What makes gain special here?  I think I know the answer (range limited in the buffer for efficiency), but it's not explained at all.

What happens when getChannelData is called with an out-of-range channel (e.g. channel 1 where numberOfChannels === 1).
4.10. The AudioBufferSourceNode Interface
It seems noteOn(x) is identical to noteGrainOn(x, 0, Infinity).  Can the two functions be merged?

How does the loop property interact with noteGrainOn?  Is only the selected "grain" looping?

What happens if I change the loop property while data is playing?  What about if buffer changes?
4.12. The JavaScriptAudioNode Interface
The text:

   numberOfInputs  : 1
   numberOfOutputs : 1

is confusing, especially because a few paragraphs later the spec says the number of inputs and outputs can be variable:

numberOfInputChannels and numberOfOutputChannels determine the number of input and output channels. It is invalid for both numberOfInputChannels and numberOfOutputChannels to be zero.

AudioProcessingEvent does not give you multiple input or output buffers, so it seems the API for JavaScriptAudioNode requires only one input and one output channel (exactly).

Can JavaScriptAudioNode not work with other threads (i.e. WebWorkers)?  Using a WebWorker would be the first thing I'd try with JavaScriptAudioNode.
4.15. The AudioListener Interface
Why is AudioListener not an AudioNode?  It seems odd to special-case this type.
9. Channel up-mixing and down-mixing
The following paths of up-mixing have different results:

1 -> 5.1
1 -> 2 -> 5.1

Is this intended behaviour?

No paths for down-mixing a 5.1 layout are defined.
10. Event Scheduling
It's very concerning that this section is incomplete, especially because two of our goals (performance measurement and synchronization) demand events.






--
Alistair MacDonald
SignedOn, Inc - W3C Audio WG
Boston, MA, (707) 701-3730<tel:%28707%29%20701-3730>
al@signedon.com<mailto:al@signedon.com> - http://signedon.com
Received on Wednesday, 28 March 2012 01:49:22 UTC