Re: Reviewing the Web Audio API (from webrtc) from Chris Rogers on 2012-04-18 (public-audio@w3.org from April to June 2012)

From: Chris Rogers <crogers@google.com>
Date: Tue, 17 Apr 2012 18:51:00 -0700
To: Randell Jesup <randell-ietf@jesup.org>
Cc: public-audio@w3.org
Message-ID: <CA+EzO0kZeX14kDMkS1F5KdVBArFnQceuL-hHUvKqd2oFD4ZLWg@mail.gmail.com>
On Tue, Apr 17, 2012 at 5:23 PM, Randell Jesup <randell-ietf@jesup.org>wrote:

>
>    I'm not sure I understand the question.  The
>>> MediaElementAudioSourceNode
>>> is used to gain access to an <audio> or <video> element for streaming
>>> file content and not to gain access to a MediaStream.  For WebRTC
>>> purposes I believe something like createMediaStreamSource()
>>> and createMediaStreamDestination() (or their track-based versions) will
>>> be necessary.
>>>
>>
>
>>    * perhaps createMediaStreamSource / Destination should work on track
>>    level instead (as you seem to indicate as well); a MediaStream is
>>    really just a collection of tracks, and those can be audio or video
>>    tracks. If you work on track level you can do processing that
>>    results in an audio track and combine that with a video track into a
>>    MediaStream
>>
>>
>> Yes, I think that based on previous discussions we've had that we'll
>> need more track-based versions of createMediaStreamSource / Destination.
>>  Although perhaps we could have both.  For a simple use,
>> if createMediaStreamSource() were used, then it would grab the first
>> audio track from the stream and use that by default.  How does that
>> sound?  Because often a MediaStream would contain only a single audio
>> track?
>>
>
>>  That sounds reasonable. I think in many cases there will only be a
>> single audio track.
>>
>>
> So it sounds like to modify audio in a MediaStream you'll need to:
>
> * Extract each track from a MediaStream
> * Turn each track into a source (might be combined with previous step)
> * Attach each source to a graph
> * Extract tracks from the destination of the graphs
> * Extract the video stream(s) from the MediaStream source
> * Combine all the tracks back into a new MediaStream
>
> This is a lot of decomposition and recomposition, and a bunch of code to
> add in almost every instance where we're doing anything more complex than
> volume to a MediaStream.
>

It sounds like a few lines of JavaScript even for this case of multiple
audio tracks per stream.  And I would expect it to be desirable to split
out the separate tracks anyway for individual processing.  But is it
usually the case that a MediaStream will contain multiple audio tracks?  It
was my understanding that in many cases there would be a single audio track
per stream.  Surely this will be the usual case with local input from
getUserMedia().  And in a tele-conference scenario wouldn't there be
multiple MediaStreams coming from different peers, each usually having a
single audio track?  Perhaps I misunderstand the most common cases here.



>
> On a separate note, while not directly applicable to Audio, I'll toss my
> personal opinion in that we want a unified framework to process media in
> (audio or video).  We've already seen lots of people modifying the video
> from WebRTC and from getUserMedia() (from silly antlers to instagram-like
> effects, etc), and we know they'll want to do more (face tracking, visual
> ID, QR code recognizers, etc), and running everything through a <canvas> is
> not a great solution (laggy, low performance, stalls main-thread, etc).
>

I'm sure we can make performance improvements to our graphics/video
presentation APIs and implementation, but this need not be shoehorned
together into our audio processing architecture which has its own unique
set of stringent real-time constraints for games and interactive
applications.  Well designed and well-factored APIs can be used together in
powerful ways without creating monolithic architecture which can overly
generalize concepts unique to specific media types.

Although we're still in the very early days of demos for WebRTC, here's a
really interesting one illustrating how these APIs can be combined in a
very interesting way:
http://www.soundstep.com/blog/2012/03/22/javascript-motion-detection/?utm_source=rss&utm_medium=rss&utm_campaign=javascript-motion-detection



>
> My thought is that
> a) we should have an easier way to process data sourced from or going to a
> MediaStream
> b) we need a framework we can cleanly apply to processing video
> c) Main-thread JS is of very limited utility in practice because of
> GC/CC/UI/pageloads/etc, but the ability to process audio in JS gives us a
> huge escape valve for functionalities that aren't built-in.
>
> I noted in the archives Chris indicated that adding support for JS Workers
> was in the works (Feb 1):
>
>  Jussi Kalliokoski has asked about adding web workers to the
> JavaScriptAudioNode on this list a little while back.  We also discussed
> this at the W3C face-to-face meeting very recently and agreed that this
> should be added to the JavaScriptAudioNode spec.  It will amount to a very
> small API change, so I'll try to update the specification document soon.  I
> want to make clear that simply moving JavaScript to a worker thread doesn't
> solve every problem.  Garbage collection stalls are still very much an
> issue, and these are quite irksome to deal with in a real-time system,
> where we would like to achieve low latency without glitches or stutters.
>
>  Has there been any progress on this?
>

I believe there have been some API discussions on this list about this, but
no resolution yet.  As I mentioned before I thought that Robert's general
approach seemed reasonable.



> I should note that an audio (or video) processing worker would typically
> throw no garbage (and so avoid GC), and even if there is garbage, there
> would be almost no live roots and GC/CC would be very fast.
>

I'm sure this would vary greatly depending on the particular JS code
running in the worker and the particular JS engine implementation.


> Audio processing in JS on the main thread is virtually a non-starter due
> to lag/jerk/etc.
>

You're right that the problems on the main thread are worse, but
nevertheless some people have expressed the desire to be able to process on
the main thread.  It's much simpler to deal with in terms of sharing JS
context/variables, and running JS code in a worker brings in its own set of
complications.  I think both could be useful, depending on the application.


>
> Chris also wrote in that message:
>
> > Chris, in the Audio Web API, you have some kind of predefined effects and
> > also a way to define custom processings in Javascript (this could also be
> > done at low level with C implementations, and may be a way to load this 'C
> > audio plugin' in browser ?).
>
> It would be great to be able to load custom C/C++ plugins (like VST or
> AudioUnits), where a single AudioNode corresponds to a loaded code module.
>  But there are very serious security implications with this idea, so
> unfortunately it's not so simple (using either my or Robert's approach).
>
>  In either it might be possible to load an emscripten-compiled C/C++
> filter; the performance likely would be no better than a well-hand-coded
> native JS filter (circa 1/3 raw C/C++ speed, YMMV) - but there are plenty
> of existing C filters available.  Also, emscripten doesn't produce garbage
> when running, which is good.
>

People can certainly try that approach, and we should do nothing to stop
them, but it can hardly be called user-friendly.  I think you might be
underestimating the complexity of defining a "plugin" format similar to VST
or AudioUnits and wrapping it up in emscripten-compiled C/C++.  Debugging
could also prove to be a nightmare.  It certainly should not be the
starting point for how we expect people to process and synthesize
sophisticated audio effects on the web.

Chris
Received on Wednesday, 18 April 2012 01:51:34 UTC