W3C home > Mailing lists > Public > public-audio@w3.org > April to June 2012

Re: Reviewing the Web Audio API (from webrtc)

From: Chris Rogers <crogers@google.com>
Date: Fri, 20 Apr 2012 15:26:30 -0700
Message-ID: <CA+EzO0kn7D9NZQDYZ+u1GaMoE6MD4pyf+rrn2Jh0f_VzUa_pPw@mail.gmail.com>
To: robert@ocallahan.org
Cc: public-audio@w3.org
On Thu, Apr 19, 2012 at 4:39 AM, Robert O'Callahan <robert@ocallahan.org>wrote:

> On Thu, Apr 19, 2012 at 10:36 AM, Robert O'Callahan <robert@ocallahan.org>wrote:
>> On Wed, Apr 18, 2012 at 12:23 PM, Randell Jesup <randell-ietf@jesup.org>wrote:
>>> So it sounds like to modify audio in a MediaStream you'll need to:
>>> * Extract each track from a MediaStream
>>> * Turn each track into a source (might be combined with previous step)
>>> * Attach each source to a graph
>>> * Extract tracks from the destination of the graphs
>>> * Extract the video stream(s) from the MediaStream source
>>> * Combine all the tracks back into a new MediaStream
>> And one of the downsides of doing it this way is that you lose sync
>> between the audio and video streams. Usually not by much, but more for
>> certain kinds of processing. Given there's a way to not lose sync at all,
>> why not use it? Sorry to harp on this :-).
> Offline, Chris and I discussed some ways in which Web Audio could
> propagate latency information to solve this problem. It's probably better
> for him to explain his ideas since I'm not completely sure how they would
> work.
> It gets more complicated if we have the ability to pause media streams
> (which I think we will, e.g. for live broadcasts that aren't interactive,
> it makes sense to pause instead of just dropping data). Since Web Audio
> can't pause, the paused time has to be accounted for and essentially a time
> slice of the Web Audio output corresponding to the pause interval has to be
> clipped out. And that's going to be annoying if your filter is something
> like an echo ... some echo would be lost, I think.
> Maybe all these issues can be solved, or deemed not worth addressing, but
> supporting pause seems appealing to me.

Hi Rob, it was good talking with you the other day.  It was nice to have a
technical discussion.

In terms of pausing, we already have this notion with the HTMLMediaElement
and MediaController APIs.  They both have a pause() method.
There's more than one way to think of the concept of "pause".  In other
words, it's not just one specific behavior.  For example, it might be
desirable in some cases for a reverb tail to continue playing after someone
pauses a particular <audio> element.  Other times, that might not be the
desired behavior.  I talked about this in more detail in this thread:

So it's probably worth creating use cases of very specific and distinct
pause scenarios with very specific desired behavior.  Then we can talk
about how we might approach each one.

Latency/Synchronization is a very complex topic.  I'll try my best to bring
out some ideas about this:

Some audio processing algorithms incur a latency (delay) such that the
processed audio stream can become mis-aligned with other audio streams or
video streams.  Depending on the particular case, there are different ways
of handling this mis-alignment.  In some cases it's best to *not* apply
*any* kind of compensation, for example when processing a live real-time
audio stream (person playing guitar with processing effects) it would not
be desirable to add a compensating delay when mixing and sharing effects
with other audio sources which have a latency.  Very similarly, it may not
be desirable to compensate for latency with sound effects triggered
according to game play.  I think in your example of video synchronization,
then a compensation would be much more desirable, especially if the latency
were large (as in your example).

So, we're faced with:
1. How can the system determine how much latency is caused by a particular
processing node, or a chain of processing nodes between a particular source
and destination?
2. What strategies are available for latency compensating (thus restoring
relative synchronization)?
3. What scenarios face potential latency and sync problems, and in
what scenarios do we want to apply a given compensation strategy from (2),
or apply no compensation whatsoever?

I'll start  by explaining how this is usually accomplished in pro-audio
applications.  Although my example is using Apple's Logic Audio and
AudioUnit plugins, this technique is widely used by other pro-audio
applications on different platforms using different plugin formats.  As a
side note, the analogy of AudioUnit plugin is "AudioNode" in the Web Audio

First of all, an individual processing node (AudioUnit plugin in this case)
reports its latency via a property (please search for

Then a host (digital audio work-station application) loads this plugin and
may query its latency and make use of this information.  For example, Logic
Audio has a detailed tech page about this and the strategies available for
dealing with latency/synchronization:

So, how does this apply in the Web Audio API?  Jer Noble has added some
smarts in our implementation to determine the latency for each AudioNode
using a virtual method called "latencyTime()":

One thing to note is that we have not *yet* implemented full support for
this using JavaScriptAudioNode.  I anticipate it would be useful for a
developer implementing custom JS code using a JavaScriptAudioNode to have a
way to report latency by setting a .latency attribute.  So with your
ducking example the developer would set this value to 1 (for 1 second

So based on this information it's possible to calculate latency from any
point to another point so that a strategy (if any) may be chosen to

First I'll consider your ducking example as case (a)
a.  For synchronizing video, the strategy would be to compensate the video
frame presentation by an equivalent delay.  This technique can be applied
automatically by the implementation since it knows the exact amount of
latency (from info about each node's latency).

The Logic Audio tech page goes into some detail about two particular
strategies for aligning audio streams, and it's worth reading:

We can summarize these two techniques as (b) and (c) below.
b. Scheduling compensation: For sounds having a latency, compensate by
scheduling sound events to happen earlier than normal by an equivalent
amount of the total latency.  Because the sounds are triggered earlier than
normal, once they are processed through effects with latency, they will
sound at the correct time.  This technique is not available to "live"
sounds such as playing guitar live, receiving live streams from WebRTC, or
playing a MIDI keyboard live.
c. Delay compensation: For sounds having less latency than other sounds,
insert a delay node with equivalent delay to make up the difference.

a. Audio from a <video> element is processed with latency (as in your
ducking example).  In this case the strategy of delaying video frame
presentation would be used, and could be accomplished automatically by the
implementation because it has all necessary information available to apply
this strategy.
b. Several synthesizer instruments (Drums, Bass, Piano) are playing notes
via a sequencer (pre-determined sequence of notes).  One or more of the
instruments have in-line effects with latency.  This is one of the exact
cases in the Logic Audio tech page.  The compensation strategy is to offset
scheduled times to play earlier than normal, thus notes which will play
with synthesizers having 30ms delay can be scheduled exactly 30ms early.
c. Consider the (b) case above, but additionally the user is playing a MIDI
keyboard along with the sequenced music, triggering synthesizer notes
(Space Synth) using effects which have no latency.  In this case, it would
be highly undesirable to use *any* latency compensation for "Space
Synth" because the user wishes to hear the notes played on the MIDI
keyboard immediately  with no delay and will be playing in-time with the
other synthesizers which are already relatively synchronized with each
d. A music track is played and processed by an effect which has latency.
 But we wish to hear the effected "wet" sound mixed with the original "dry"
signal to achieve an appropriate amount of dry/wet blend.  The strategy
used is to insert a delay on the "dry/unprocessed/original" signal having
equivalent latency to the effect, then they may be mixed and blended in


  In the Web Audio API implementation in WebKit each node (internally)
reports its own latency.  Jer Noble added this ability a little while ago:

> Rob
> --
> “You have heard that it was said, ‘Love your neighbor and hate your
> enemy.’ But I tell you, love your enemies and pray for those who persecute
> you, that you may be children of your Father in heaven. ... If you love
> those who love you, what reward will you get? Are not even the tax
> collectors doing that? And if you greet only your own people, what are you
> doing more than others?" [Matthew 5:43-47]
Received on Friday, 20 April 2012 22:26:59 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:03:04 UTC