Re: Reviewing the Web Audio API (from webrtc) from Randell Jesup on 2012-04-18 (public-audio@w3.org from April to June 2012)

From: Randell Jesup <randell-ietf@jesup.org>
Date: Tue, 17 Apr 2012 23:50:42 -0400
To: public-audio@w3.org
Message-ID: <4F8E3A12.1030704@jesup.org>
On 4/17/2012 9:51 PM, Chris Rogers wrote:
>
> On Tue, Apr 17, 2012 at 5:23 PM, Randell Jesup <randell-ietf@jesup.org 
> <mailto:randell-ietf@jesup.org>> wrote:
>
>
>     So it sounds like to modify audio in a MediaStream you'll need to:
>
>     * Extract each track from a MediaStream
>     * Turn each track into a source (might be combined with previous step)
>     * Attach each source to a graph
>     * Extract tracks from the destination of the graphs
>     * Extract the video stream(s) from the MediaStream source
>     * Combine all the tracks back into a new MediaStream
>
>     This is a lot of decomposition and recomposition, and a bunch of
>     code to add in almost every instance where we're doing anything
>     more complex than volume to a MediaStream.
>
>
> It sounds like a few lines of JavaScript even for this case of 
> multiple audio tracks per stream.  And I would expect it to be 
> desirable to split out the separate tracks anyway for individual 
> processing.

I doubt you'd apply different filters to the different tracks very often.

> But is it usually the case that a MediaStream will contain multiple 
> audio tracks?  It was my understanding that in many cases there would 
> be a single audio track per stream.  Surely this will be the usual 
> case with local input from getUserMedia().  And in a tele-conference 
> scenario wouldn't there be multiple MediaStreams coming from different 
> peers, each usually having a single audio track?  Perhaps I 
> misunderstand the most common cases here.

No, that's the common case - but you'll need to write the code for the 
general case or it will break whenever there is a stream with multiple 
tracks, which would then make multiple tracks an effectively unusable 
feature.

>
>     On a separate note, while not directly applicable to Audio, I'll
>     toss my personal opinion in that we want a unified framework to
>     process media in (audio or video).  We've already seen lots of
>     people modifying the video from WebRTC and from getUserMedia()
>     (from silly antlers to instagram-like effects, etc), and we know
>     they'll want to do more (face tracking, visual ID, QR code
>     recognizers, etc), and running everything through a <canvas> is
>     not a great solution (laggy, low performance, stalls main-thread,
>     etc).
>
>
> I'm sure we can make performance improvements to our graphics/video 
> presentation APIs and implementation, but this need not be shoehorned 
> together into our audio processing architecture which has its own 
> unique set of stringent real-time constraints for games and 
> interactive applications.  Well designed and well-factored APIs can be 
> used together in powerful ways without creating monolithic 
> architecture which can overly generalize concepts unique to specific 
> media types.

Sure, but I have to say this seems like a very powerful and logically 
consistent approach.  And honestly we need an API for processing video 
Real Soon Now, and I see no other pathway to getting one.

>
> Although we're still in the very early days of demos for WebRTC, 
> here's a really interesting one illustrating how these APIs can be 
> combined in a very interesting way:
> http://www.soundstep.com/blog/2012/03/22/javascript-motion-detection/?utm_source=rss&utm_medium=rss&utm_campaign=javascript-motion-detection 
> <http://www.soundstep.com/blog/2012/03/22/javascript-motion-detection/?utm_source=rss&utm_medium=rss&utm_campaign=javascript-motion-detection>

This is very reminiscent of Amiga Live!, a program demoed at he launch 
of the Amiga in 1985 (with Andy Warhol and Debbie Harry, IIRC) that 
leveraged a genlock/digitizer to let you interact with elements on the 
screen (bells, xylophone, drums, etc).

>     I should note that an audio (or video) processing worker would
>     typically throw no garbage (and so avoid GC), and even if there is
>     garbage, there would be almost no live roots and GC/CC would be
>     very fast.
>
>
> I'm sure this would vary greatly depending on the particular JS code 
> running in the worker and the particular JS engine implementation.

In general, yes, but if the code throws no garbage, there should be no GC.

>     Audio processing in JS on the main thread is virtually a
>     non-starter due to lag/jerk/etc.
>
>
> You're right that the problems on the main thread are worse, but 
> nevertheless some people have expressed the desire to be able to 
> process on the main thread.  It's much simpler to deal with in terms 
> of sharing JS context/variables, and running JS code in a worker 
> brings in its own set of complications.  I think both could be useful, 
> depending on the application.
>
>
>     Chris also wrote in that message:
>
>         >  Chris, in the Audio Web API, you have some kind of predefined effects and
>         >  also a way to define custom processings in Javascript (this could also be
>         >  done at low level with C implementations, and may be a way to load this 'C
>         >  audio plugin' in browser ?).
>
>         It would be great to be able to load custom C/C++ plugins (like VST or
>         AudioUnits), where a single AudioNode corresponds to a loaded code module.
>           But there are very serious security implications with this idea, so
>         unfortunately it's not so simple (using either my or Robert's approach).
>
>     In either it might be possible to load an emscripten-compiled
>     C/C++ filter; the performance likely would be no better than a
>     well-hand-coded native JS filter (circa 1/3 raw C/C++ speed, YMMV)
>     - but there are plenty of existing C filters available.  Also,
>     emscripten doesn't produce garbage when running, which is good.
>
>
> People can certainly try that approach, and we should do nothing to 
> stop them, but it can hardly be called user-friendly.  I think you 
> might be underestimating the complexity of defining a "plugin" format 
> similar to VST or AudioUnits and wrapping it up in emscripten-compiled 
> C/C++.  Debugging could also prove to be a nightmare.  It certainly 
> should not be the starting point for how we expect people to process 
> and synthesize sophisticated audio effects on the web.

I wasn't saying I advocated anyone doing this; but it's a way to do it, 
and do it without the same sort of security concerns.


-- 
Randell Jesup
randell-ietf@jesup.org
Received on Wednesday, 18 April 2012 03:51:39 UTC