Some general feedback on the Web Audio API spec and suggestions for improvements from Kevin Gadd on 2013-05-01 (public-audio@w3.org from April to June 2013)

From: Kevin Gadd <kevin.gadd@gmail.com>
Date: Wed, 1 May 2013 12:43:37 -0700
To: public-audio@w3.org
Message-ID: <CAPJwq3UHv+PjKZwVrkLcLMYH2QSzCajaRBDtO9DHjJLujVZ0pQ@mail.gmail.com>
Hello,
I've been trying to use the Web Audio API for over a year now to support
end users' attempts to port games that make use of native audio APIs. The
following are spec deficiencies/bugs that I think should be addressed,
based on problems I've encountered and that my users have encountered.

1. channelCount &c on AudioNodes
AudioNode is specced as having these properties and they are described as
applying to all nodes. They do not.
StackOverflow answers by cwilson (and some manual testing on my end)
indicate that AudioBufferSourceNode ignores these properties, and that it
should because it has no 'input' and they only affect 'inputs'. It also
appears that channel splitters/mergers ignore these properties as well, and
I find it hard to justify this particular behavior.

1a. If a given AudioNode does not implement these properties, attempts to
set them should throw so that end users are able to easily identify which
particular nodes are 'special' and lack support for channel count control.
This is an important enough feature that having to try and blindly debug it
by listening to your speakers is not an acceptable scenario.
1b. I also suggest that the spec be updated to explicitly state for each
given node that it does not support channelCount and kin if the node does
not support them.
1c. I also believe that the AudioBufferSourceNode behavior in this case is
kind of irrational: even if it doesn't have an input node, it has an
'input' in semantic terms, in that it's reading samples from a buffer. But
I understand if it is too complicated or weird to implement channelCount on
source nodes, and it's not the end of the world to have to put in a gain
node in order to convert mono up to stereo.


2. playbackRate on AudioBufferSourceNode
This property's behavior is effectively unspecified.

2a. Please specify the behavior. Without knowing what it does, it's not
possible to use it to achieve particular audio goals.
2b. The spec should also be updated to make it clear that you can use
playbackRate to adjust the pitch of audio being played back. All mentions
of 'pitch' in the spec merely refer to the panner node's doppler effect
support, which makes it appear as if that is the only way to accomplish
pitch shifting.  (I understand that 'pitch shifting' is not what this
property actually does, and that it instead adjusts the sampling rate of
playback in some fashion, either through a FFT or something else.)


3. Stereo panning is incredibly complicated and error-prone
At present, the only way to do stereo panning in the Web Audio API involves
3 gain nodes, a channel splitter and a channel merger. This is easy to get
wrong, in particular because issue #1 makes the most obvious implementation
not work correctly for mono sources but work correctly for stereo sources,
so you can end up with broken code out in the wild. I also consider it a
problem if playing individual samples with panning (say, in an Impulse
Tracker player) requires the creation of 5 nodes for every single active
sound instance. This seems like it would implicitly create a lot of
mixing/filtering overhead, use a lot more memory, and increase GC pressure.

3a. If possible, a simple mechanism for stereo panning should be
introduced. Ideally this could be exposed by PannerNode, or by a new
2DPannerNode type. Another option would be a variant of GainNode that
allows per-channel gain (but I dislike this option since it overlaps
ChannelSplitter/ChannelMerger too much).
3b. If a new node is not possible, the correct way to do this should be
clearly specified, in particular because channelsplitter/channelmerger
explicitly avoid specifying which channel is 'left' and which channel is
'right' in a stereo source.
3c. One other option is to clearly specify the behavior of the existing
PannerNode so that it is possible to use it to achieve 2D panning. I don't
know anyone who has done this successfully (a couple of my users tried and
failed; they claim that the PannerNode never does channel volume
attenuation.)


4. createBuffer is synchronous
The spec still does not clearly communicate anywhere to end users that one
of createBuffer's overloads does a synchronous audio decode. Current
implementations in the wild thus cause the browser to hang for multiple
seconds, unresponsive, when you call the overload that causes a synchronous
decode. Worse still, the profiler in Chrome does not record samples for
this operation, so it is very difficult to identify the problem. If an end
user simply looks over the spec's list of methods, they will almost always
choose createBuffer over decodeAudioData (it's simpler, and it has the
mixToMono parameter, so it's more powerful), and end up with an app that is
subtly broken.

4a. The steps in the spec should explicitly require a synchronous decode.
As currently written, the described steps could easily be performed
asynchronously on a mixer thread and still produce a valid result (as long
as the decoding finished before the first time the sound was actually
played).
4b. The spec should be painfully, obviously clear that using this overload
of createBuffer will hang your browser.
4c. If possible, this overload should be disabled unless running in a web
worker. But I can imagine that there may be particular use cases where a
synchronous decode on the browser's UI thread is desired.


5. It is unclear which audio formats can be decoded by
createBuffer/decodeAudioData
At present the spec appears to have no opinion about what can be decoded by
an implementation, or how you should detect the correct audio format to
use. This has already led to subtle bugs in implementations that were not
caught until I ran end users' games in browsers with implementations that
defied expectations.

5a. Update the spec to state that Audio.canPlayType should return
information that matches the behavior of the Web Audio API.
5b. Or, expose a way to query the web audio API about which mime types it
can decode.
5c. Or, explicitly state that the way you are supposed to format detect is
by downloading the entire mp3/ogg/etc versions of your sounds and trying to
decode them one at a time. I consider this an unacceptable solution, but it
would be better than the current unspecified state.


6. Pausing playback is not built into the API and workarounds have issues
At present the API exposes no way to pause playback of an
AudioBufferSourceNode. Workarounds have been proposed on StackOverflow in
other forums but these workarounds have issues (primarily that they involve
a race condition between JS and the mixer, but also, they are needlessly
complex and difficult to implement). Pausing is also near nightmare status
when looping is involved. The interaction between the current workaround
and playbackRate is also unspecified.

6a. Add pause(optional double when) and resume(optional double when)
methods to AudioBufferSourceNode.
6b. If not 6a, clearly specify the intended workaround and describe a
solution to the race condition between JS and the mixer.
6c. If not 6a, clearly describe how to implement pausing correctly with
looping active. This has not ever been stated and seems incredibly
dependent on the exact implementation of the mixer (i.e. is looping
gapless, etc)
6d. Clearly specify the interaction between the offset/duration arguments
to AudioBufferSourceNode.start and AudioBufferSourceNode.playbackRate so
that it is possible to correctly implement the pause workaround when
playbackRate is used.

To clearly state the race, the current workaround (advocated by cwilson,
iirc) is this:

When calling start(), record AudioContext.currentTime as the 'playback
start time'.
To pause, call stop(), record AudioContext.currentTime as the 'playback
stop time' and throw away your current AudioBufferSourceNode.
To resume, create a new AudioBufferSourceNode, and call start with an
offset equal to ('playback stop time' - 'playback start time').
The problem is that AudioContext.currentTime is specified as 'always moving
forward' and increasing in real-time. It cannot be paused or re-positioned.
This means that the currentTime can change between the call to stop() and
the retrieval of the currentTime attribute; furthermore, an unknown amount
of time can elapse between the call to start() and the actual beginning of
audio playback. So your recorded start time/stop time can end up off by
some unknown number of milliseconds.

As noted above, this workaround has other deficiencies as well. Even if
this workaround did not have multiple deficiencies, I believe it is
unacceptably complex for such a simple, common audio operation. Pausing and
resuming playback happens all the time. It should not be this complex and
it should not produce GC pressure.


7. Playback state of AudioBufferSourceNodes is needlessly difficult to
access
Related to 6 and 2 - I have a ton of code written to perform simple
operations like figure out whether a given AudioBufferSourceNode is
currently playing audio. No sane audio API I have ever used makes it this
hard to do something this simple. Adding in features like playbackRate and
loop makes this non-trivial to do in JS and easy to get wrong.

7a. Add an attribute to AudioBufferSourceNode, hypothetically called
isPlaying, which returns true if the node is currently playing and false if
it is not.
7b. Add an attribute to AudioBufferSourceNode, hypothetically called
playbackOffset, which returns the current playback offset of the node if it
is playing (and, given the presence of a pausing mechanism from 6a, returns
the most recent playback offset if it is paused).
7c. If pausing is added as a mechanism, expose an attribute that returns
the paused state (hypothetically called isPaused)
7d. If polling is not preferable, expose some sort of event handler or
callback that can be used to get notifications about the state of an
AudioBufferSourceNode in order to support polling use cases, like the Audio
element's 'ended' event.


If it is helpful, you can see my current Web Audio API backend
implementation here:
https://github.com/kevingadd/JSIL/blob/master/Libraries/JSIL.Browser.Audio.js#L219
Some of this feedback is based on older versions of the backend or feedback
from users, though.

Thanks
-kg
Received on Wednesday, 1 May 2013 19:44:45 UTC