Re: Reviewing the Web Audio API (from webrtc) from Randell Jesup on 2012-04-18 (public-audio@w3.org from April to June 2012)

From: Randell Jesup <randell-ietf@jesup.org>
Date: Tue, 17 Apr 2012 20:23:13 -0400
To: public-audio@w3.org
Message-ID: <4F8E0971.6080809@jesup.org>
>         I'm not sure I understand the question.  The
>         MediaElementAudioSourceNode
>         is used to gain access to an <audio> or <video> element for
>         streaming
>         file content and not to gain access to a MediaStream.  For WebRTC
>         purposes I believe something like createMediaStreamSource()
>         and createMediaStreamDestination() (or their track-based
>         versions) will
>         be necessary.
>

>
>        * perhaps createMediaStreamSource / Destination should work on
>     track
>        level instead (as you seem to indicate as well); a MediaStream is
>        really just a collection of tracks, and those can be audio or video
>        tracks. If you work on track level you can do processing that
>        results in an audio track and combine that with a video track
>     into a
>        MediaStream
>
>
>     Yes, I think that based on previous discussions we've had that we'll
>     need more track-based versions of createMediaStreamSource /
>     Destination.
>      Although perhaps we could have both.  For a simple use,
>     if createMediaStreamSource() were used, then it would grab the first
>     audio track from the stream and use that by default.  How does that
>     sound?  Because often a MediaStream would contain only a single
>     audio track?
>
>
>     That sounds reasonable. I think in many cases there will only be a
>     single audio track.
>

So it sounds like to modify audio in a MediaStream you'll need to:

* Extract each track from a MediaStream
* Turn each track into a source (might be combined with previous step)
* Attach each source to a graph
* Extract tracks from the destination of the graphs
* Extract the video stream(s) from the MediaStream source
* Combine all the tracks back into a new MediaStream

This is a lot of decomposition and recomposition, and a bunch of code to 
add in almost every instance where we're doing anything more complex 
than volume to a MediaStream.

On a separate note, while not directly applicable to Audio, I'll toss my 
personal opinion in that we want a unified framework to process media in 
(audio or video).  We've already seen lots of people modifying the video 
from WebRTC and from getUserMedia() (from silly antlers to 
instagram-like effects, etc), and we know they'll want to do more (face 
tracking, visual ID, QR code recognizers, etc), and running everything 
through a <canvas> is not a great solution (laggy, low performance, 
stalls main-thread, etc).

My thought is that
a) we should have an easier way to process data sourced from or going to 
a MediaStream
b) we need a framework we can cleanly apply to processing video
c) Main-thread JS is of very limited utility in practice because of 
GC/CC/UI/pageloads/etc, but the ability to process audio in JS gives us 
a huge escape valve for functionalities that aren't built-in.

I noted in the archives Chris indicated that adding support for JS 
Workers was in the works (Feb 1):

    Jussi Kalliokoski has asked about adding web workers to the
    JavaScriptAudioNode on this list a little while back.  We also discussed
    this at the W3C face-to-face meeting very recently and agreed that this
    should be added to the JavaScriptAudioNode spec.  It will amount to a very
    small API change, so I'll try to update the specification document soon.  I
    want to make clear that simply moving JavaScript to a worker thread doesn't
    solve every problem.  Garbage collection stalls are still very much an
    issue, and these are quite irksome to deal with in a real-time system,
    where we would like to achieve low latency without glitches or stutters.

Has there been any progress on this?  I should note that an audio (or 
video) processing worker would typically throw no garbage (and so avoid 
GC), and even if there is garbage, there would be almost no live roots 
and GC/CC would be very fast.  Audio processing in JS on the main thread 
is virtually a non-starter due to lag/jerk/etc.

Chris also wrote in that message:

    >  Chris, in the Audio Web API, you have some kind of predefined effects and
    >  also a way to define custom processings in Javascript (this could also be
    >  done at low level with C implementations, and may be a way to load this 'C
    >  audio plugin' in browser ?).

    It would be great to be able to load custom C/C++ plugins (like VST or
    AudioUnits), where a single AudioNode corresponds to a loaded code module.
      But there are very serious security implications with this idea, so
    unfortunately it's not so simple (using either my or Robert's approach).

In either it might be possible to load an emscripten-compiled C/C++ 
filter; the performance likely would be no better than a well-hand-coded 
native JS filter (circa 1/3 raw C/C++ speed, YMMV) - but there are 
plenty of existing C filters available.  Also, emscripten doesn't 
produce garbage when running, which is good.

In my mind, many of the differences between the specs are resolvable.  
Rob has said his design doesn't preclude predefining native processing 
filters, and it sounds like Chris is open to JS Workers.  I believe we 
need something that integrates better with MediaStreams, and gives a 
framework for video processing (which would speak to something closer to 
MediaStream Processing for a source/destination API), and I think we 
need to easily be able to leverage some pre-defined processing nodes 
(from Chris' spec).  With a design like this, typical uses wouldn't need 
any sample-by-sample JS processing, but whenever that is needed it can 
run smoothly.

-- 
Randell Jesup
randell-ietf@jesup.org
Received on Wednesday, 18 April 2012 00:24:10 UTC