Re: Web Audio API Feedback from Spaceport.io from Chris Rogers on 2012-03-27 (public-audio@w3.org from January to March 2012)

From: Chris Rogers <crogers@google.com>
Date: Tue, 27 Mar 2012 16:46:40 -0700
To: Alistair MacDonald <al@signedon.com>
Cc: public-audio@w3.org, matt@spaceport.io, ben@sibblingz.com
Message-ID: <CA+EzO0nFON_Jy3+DN2sy+w7F5RNgvftO55OJYDScpGcPj69UQg@mail.gmail.com>
Hi Al, thanks for your feedback.  I'll try to answer the comments inline
below:

On Tue, Mar 27, 2012 at 2:18 PM, Alistair MacDonald <al@signedon.com> wrote:

> About a month ago, I reached out to Ben and Matt from http://spaceport.io/.
> Spaceport a mobile gaming company providing an SDK that allows people to
> build games with JavaScript and using Flash for the audio.
>
> I asked them if they would be kind enough to review one of the specs. They
> wanted to review the Web Audio API, and took the time to go through it in
> some detail.
>
> I wanted to thank Matt for the detailed and thoughtful contribution, and
> encourage people to discuss the comments.
>
> There may be some great points in here that we can track or add/refine our
> use-cases.
>
> --
> Alistair
>
>
> * Spaceport’s goals for audio on the webAPI draft commentary In general,
> the API feels like the fixed-function pipeline of olde OpenGL.  Developers
> are given a small set of functionality with which they can transform sound.
>  Programmatic capabilities are very limiting; developers cannot implement
> many of the existing AudioNode types using the JavaScriptAudioNode API.*
>

The JavaScriptAudioNode allows arbitrary DSP algorithms to be implemented,
and so I would consider that JavaScript is the "shader language".  There
are some performance limitations there compared with native code,  and
that's why there are common audio processing algorithms provided by the
other AudioNodes.  You may consider that these nodes provide a "small set
of functionality", but in practice they can be quite versatile when used in
combination.  For example, in combination, they can implement advanced game
engine features typically seen in native/desktop or console games.


> *  Developers will find it difficult to output audio to buffers, files,
> and the network.*
>

I don't agree.  Tomas Senart from SoundCloud implemented a basic audio
editing tool which can download audio files from SoundCloud, edit them,
then upload a finished mix:

http://audiojedit.herokuapp.com/

You can type "H" for help, search for any track on SoundCloud (for example
search for "risk of djinxing")

People are doing some really interesting collaborative (network based)
applications:
http://www.technitone.com/
http://labs.dinahmoe.com/plink/
http://www.multiplayerpiano.com/






> *
>
> While an initial review DOES seem to indicate that most common pieces of
> functionality will be possible to implement, it lacks the elegance and
> extensibility of an OpenGL 2.0 like approach.  Perhaps in a later release
> web-developers could write “audio-shaders” (filters are probably a better
> name) in some DSL that could be compiled and run on the sound card in the
> same way that shaders are written in GLSL and passed down to the graphics
> card.  Perhaps in this type of approach one would not need to provide
> AudioNodes like the AudioGain, and AudioPanner, etc AudioNodes, and instead
> would only need to support a single AudioNode that could implement all of
> the above.*
>

I'm sorry you don't like the "elegance" :)  As I mention above, you can
consider that JavaScript itself is the shader language.  In the general
case, "sound cards" on machines these days don't have general-purpose
standarised DSP capabilities, so this idea is not feasible.  One idea which
seems interesting is the ability to download arbitrary native code modules
(like a VST or AudioUnit plugin).  But, we've discussed previously on this
list that there are serious security implications (think ActiveX) so this
won't be possible.



> *
> 4.1. The AudioContext InterfaceWhere is it stated that there is a global
> AudioContext constructor?*
>

Should be added to the spec.


> *
>
> It seems odd that the sample rate is a read-only property of the
> AudioContext.  It seems like each node should be able to specify its
> desired sample rate of processing, and the AudioDestinationNode controls
> the default (because it is the native sound device (?)).  Can I not
> resample?  Can I not mix a 22khz .wav and a 44.1khz .mp3 and a 48khz
> generated sample?*
>

You're mixing up a few concepts here.  Yes, you can mix a 22KHz .wav and a
44.1KHz .mp3 and a 48KHz generated sample.  Audio assets are automatically
sample-rate converted to the context's sample-rate, so the developer does
not have to worry about these low-level details.  They can simply mix and
match assets of different types and sample-rates.  For some very
specialized use cases, people like Jussi have requested being able to
create an AudioContext at a very specific sample-rate for use with the
JavaScriptAudioNode.  This has been discussed on-list and in the
tele-conference calls and I think it's understood that this can be useful.
 But the vast majority of use cases will not need this type of control.



> *  Why can I create AudioBuffer objects with a different sample rate if
> it explicitly not going to work properly?*
>

I'm not sure what you mean.  The sample-rate *is* taken into account when
used with an AudioBufferSourceNode.


> *
>
> Why is there no currentSample getter?  One of the goals in 1.1. Features
> is "Sample-accurate scheduled sound playback".  Dealing with floating-point
> arithmetic can lead to sample-accurate calculations.  From minimal testing
> in C with double floating-point precision, this doesn't seem to be much of
> a problem, but I still feel it is something to be concerned about.*
>

Are you talking about "float" vs. "double" precision?  I agree that most of
the APIs involving times should be upgraded to "double".


> *
>
> Ben brought up a good point about createBuffer(ArrayBuffer, boolean): how
> is the data in the buffer interpreted?  Is it read as an array of Float32's
> which are interleaved for channels?*
>

No the data is interpreted as audio file data such as from a .wav, .mp3
file.



> *  How is the number of channels determined?  It's very unclear what
> exactly the buffer data is or should be.*
>

The number of channels is encoded in the audio file data.  For example if
the data comes from a .wav file then it might be mono or it might be stereo
(or more channels).  The buffer data is ordinary audio file data which can
be fetched with an XHR2 request, read using the File API, received via Web
Sockets, etc.


> *
>
> createBuffer(ArrayBuffer, boolean) and decodeAudioData seem to be very
> similar (i.e. they accomplish the same thing).  Ben suggests that
> createBuffer(ArrayBuffer, boolean) be renamed to decodeAudioDataSync to
> show this similarity.  (It then becomes clear that mixToMono is present in
> the synchronous version but not in the asynchronous version.)*
>

That may be a good name-change.  Or, we could just deprecate the older
createBuffer() API, since synchronous blocking APIs like this are not as
desirable now that we have an async method.


> *
>
> Is it possible to have more than one AudioContext?  It's stated that this
> is not common - but what is the expected behaviour if you do have more than
> 1?  What if an ad-provider has sound in their advert, and I have sound in
> my game (within Facebook frame) and Facebook has a "ding" sound in their
> code notifying me of incoming chat messages.  Presumably all 3 will
> separately instantiate an AudioContext.  How will/should these interact?*
>

They all play/mix sound, each from their own contexts.


> *
>
> Because the number of inputs and outputs is constant for each node, what
> happens if there is an unfilled input and output is needed, or if there is
> an unfilled output and input is given?*
>

I'm not sure I understand the specific question.  A specific example would
be helpful.


> *
>
> For createJavaScriptNode, both Ben and I think it makes sense that you
> should be able to add source and destination nodes as needed, and that the
> number of inputs and outputs does not need to be specified at the time of
> creation.  See our comments on 4.12. The JavaScriptAudioNode Interface for
> more information.
> 4.2 The AudioNode InterfaceAre AudioNodes forced to form a directed
> acyclic graph?  What happens if there is a cycle in the graph?*
>

Cycles are allowed, but are only really useful when used with a DelayNode.
 In this way delays with effects in the feedback loop are possible.



> *
>
> Based upon my reading of the “connect” method there is a concept of
> “ordering” with respect to the inputs and outputs of an AudioNode.  These
> appear to be a zero-based index system.  It seems to me that if this is the
> case there ought to be a good way to iterate over the inputs / outputs.
>
> for( var i = 0; i < audioNode.inputs.length; i++ ){
>
> var input = audioNode.inputs[i];
>
> if( input ){
>
>  // do I need to check and see if this input is valid?
>
>  // I wouldn’t have to do this if it wasn’t a sparse array
>
>   // do something with input
>
> }
>
> }
>
> for( var i = 0; i < audioNode.outputs.length; i++ ){
>
> var output = audioNode.outputs[i];
>
> if( output ){
>
>  // do I need to check and see if this output is valid?
>
>  // I wouldn’t have to do this if it wasn’t a sparse array
>
>  // do something with output
>
> }
>
> }
>
> I’m certain that at some point I will want to do something like this - and
> there doesn’t seem to be an explicit way.*
>

I'm not so sure that it will be that useful.  It would be good to see some
strong use cases for this.



> *
>
> Also, if I remove an input / output do the other inputs / outputs shift
> downward?  None of this is explicit in this interface.*
>

There is no "removing" an input or output.


> *
>
> Is this a sparse array?  If I were to create an audio node with no inputs
> or outputs and then call connect with the output set to something other
> than zero would that be valid?  Is there a “max-inputs” or “max-outputs”
> value?*
>

They are not sparse arrays.  This really doesn't come up in practice.  Most
of the nodes have a known number of inputs and outputs (usually 0 or 1).
 The AudioChannelSplitter and AudioChannelMerger are a little different,
but have well-defined behavior (I'm happy to improve the text in the spec
to make it more clear).


> *
>
> Is numberOfInputs / numberOfOutputs constant, or variable as I call
> connect / disconnect?  If it is constant, how can I determine which slots
> have valid inputs / outputs.  Perhaps one could add a
> audioNode.hasInputAt(i), and hasOutputAt(i) set of methods?*
>

This really only comes up with AudioChannelSplitter and AudioChannelMerger,
where I think it won't be necessary to add such extra methods.  But we can
discuss this in more detail in another thread...

*
>
> Why can’t I disconnect an input?  Disconnect only seems to work on
> outputs.  Finding the input node and the appropriate index of the output
> from that node could be a bear.  Now think about this: “It is possible to
> connect an AudioNode output to more than one input with multiple calls to
> connect(). Thus, "fanout" is supported” If the input to my audioNode has
> more than one output (it’s a fanout) then I cannot with this spec
> disconnect this connection without disconnecting all of the other
> connections.*
>

Disconnecting an input *might* be useful to have, and something worth
considering.



> *
>
> Matt suggests that the inputs and outputs be encapsulated in collections,
> which can be iterated upon and manipulated.  Alternatively, filters can be
> made distinct from fusers, splitters, sources, and sinks.
> 4.4. The AudioDestinationNode InterfaceI assume there is only one
> implementation of AudioDestinationNode (a speaker), and it is created when
> you create a new AudioContext.  This does make sense to me, but I don't see
> why it was done this way.
>
> I would expect there to be a global list of sound devices (connected
> microphones, sound outputs (headphones, speakers), etc.) which I manually
> route. *
>

Device enumeration/selection, etc. is something we'll definitely have to
consider...



> * The example in 1.2. Modular Routing could read:
>
> var context = new AudioContext();
>
> assert(context.destination == null);
>
> context.destination = AudioDevices.defaultOutput;
>
> function playSound() {
>
>    var source = context.createBufferSource();
>
>    source.buffer = dogBarkingBuffer;
>
>    source.connect(context.destination);
>
>    source.noteOn(0);
>
> }
>
> In other words, can I mix into something other than a speaker?  What if I
> just want to mix sound and then ship it off to the internet, or to disk, or
> to another buffer?*
>

These are all different cases worth considering separately, but in one way
or another these things are already possible:

1. "Ship it off to the internet":
a) WebSockets: I've seen demos developers have written sending audio data
via WebSockets (using a JavaScriptAudioNode)
b) WebRTC: here's an early provisional spec which has been discussed on
this list and TPAC:
https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/webrtc-integration.html

2) Writing to a file: today it's possible to intercept the stream using a
JavaScriptAudioNode and write asynchronously to a file.  It might be
interesting to consider a more full-fledged AudioFileRecorder node.  For
anything technical such as implementing DAW software it will be important
to be able to have precise control over *exactly* what time the recording
starts.

3) To another buffer:  WebKit has something (internally used for testing)
called an OfflineAudioContext which writes (faster than real-time) to an
AudioBuffer.
There's been some recent discussion of adding this to the spec since it
would be of more general use.



> *
>
> Is it an error to not specify a destination for a node?*
>

I'm not clear what this means.


> *
> 4.5. The AudioParam InterfaceIt seems AudioParam is a prime candidate for
> JavaScript control.  It doesn't seem I can easily implement tremolo (or
> vibrato) using the current API.*
>

There's currently quite a lot of fine-grained control of an AudioParam.
 Tremolo or vibrato are currently possible, but not as elegant as when it
becomes possible to modulate an AudioParam with an audio-rate signal.


> *
>
> In short - people will never be happy and will always demand more and more
> methods.  Linear and Exponential are great, but why not just let someone
> specify and arbitrary function pointer that returns a value?  This allows a
> far better level of control.*
>

Arbitrary sample-accurate parameter changes can be specified with:
partial interface AudioParam {
        void setValueCurveAtTime(in Float32Array values, in float time, in
float duration);
}

So this level of control is already possible.



> *
> 4.6 AudioGain “the nominal maxValue is 1.0”.  What? does that mean the
> “default value” is 1.0?  Do values greater than 1 increase volume, or just
> clamp to 1.0?  The “no exception thrown” doesn’t tell me what behaviour to
> expect.*
>

Values greater than 1.0 *do* increase volume.  Chris Lowis has proposed
that values be allowed to go negative (for phase inversion) and I think
this is a good idea.  There's probably some language in this part of the
spec that can be made more precise.


> *
> 4.9. The AudioBuffer InterfaceShouldn't audio gain be a separate
> AudioNode type?  What makes gain special here?  I think I know the answer
> (range limited in the buffer for efficiency), but it's not explained at all.
> *
>

Yes, I think the .gain attribute of AudioBuffer should be removed from the
spec as James Wei has proposed.  Because it's true that the gain change can
just be handled separately in an AudioGainNode.


> *
>
> What happens when getChannelData is called with an out-of-range channel
> (e.g. channel 1 where numberOfChannels === 1).*
>

Should throw exception.  This should be spelled out in the spec.


> *
> 4.10. The AudioBufferSourceNode InterfaceIt seems noteOn(x) is identical
> to noteGrainOn(x, 0, Infinity).  Can the two functions be merged?*
>

That might be a good suggestion.  But, we should think carefully about
that...


> *
>
> How does the loop property interact with noteGrainOn?  Is only the
> selected “grain” looping?*
>

It should be noted in the spec that it doesn't apply for noteGrainOn()


> *
>
> What happens if I change the loop property while data is playing?  What
> about if buffer changes?*
>

All good questions, and should be spelled out precisely.  I think the loop
property should be ignored after noteOn() calls.
Buffers should be allowed to change at any time.



> *
> 4.12. The JavaScriptAudioNode InterfaceThe text:
>
>    numberOfInputs  : 1
>     numberOfOutputs : 1
>
> is confusing, especially because a few paragraphs later the spec says the
> number of inputs and outputs can be variable:*
>

Yes, I've already made one pass of cleanup in this section.  But, I should
probably do some more work here.


> *
>
> numberOfInputChannels and numberOfOutputChannels determine the number of
> input and output channels. It is invalid for both numberOfInputChannels and
> numberOfOutputChannels to be zero.
>
> AudioProcessingEvent does not give you multiple input or output buffers,
> so it seems the API for JavaScriptAudioNode requires only one input and one
> output channel (exactly).
>
> Can JavaScriptAudioNode not work with other threads (i.e. WebWorkers)?
>  Using a WebWorker would be the first thing I’d try with
> JavaScriptAudioNode.*
>

There's been a discussion thread (or two) in the past month or two about
Web Workers and how this could work with JavaScriptAudioNode.



> *
> 4.15. The AudioListener InterfaceWhy is AudioListener not an AudioNode?
>  It seems odd to special-case this type.*
>

Because it is not a source, processor, or sink of audio.  This is similar
to AudioParam which is also a distinct type of object (not an AudioNode).
 By the way, the notion of AudioListener (and aspects of AudioPannerNode)
have been taken practically verbatim from the OpenAL specification.


> *
> 9. Channel up-mixing and down-mixingThe following paths of up-mixing have
> different results:
>
> 1 -> 5.1
> 1 -> 2 -> 5.1
>
> Is this intended behaviour?*
>

Yes, once a mono channel has been mixed into a stereo stream (with other
stereo sources) then its "mono-ness" is gone.

We can quibble about whether 1 -> 5.1 should mix into center channel (as
currently in the spec) or just directly do a stereo-upmix and place in L/R
of the 5.1 channels.  These up-mix algorithms are meant to handle what
happens when sources with different numbers of channels are connected and
"we just want it to do the right thing".  If a developer  wants to have
more exact control then arbitrary matrix mixes from N-->M channels can be
done with splitters, mergers, and gain nodes.


> *
>
> No paths for down-mixing a 5.1 layout are defined.*
>

Yes, I know.  We're looking into some good appropriate mix-down
coefficients here.


> *
> 10. Event Scheduling It’s very concerning that this section is
> incomplete, especially because two of our goals (performance measurement
> and synchronization) demand events.*
>

I will do my best to improve this section.  In the meantime, please look at
the AudioParam section where the "scheduled" parameter change APIs are
found.  This is essentially all this section is talking about.

Thanks again for your comments.

Cheers,
Chris
Received on Tuesday, 27 March 2012 23:47:12 UTC