Re: DSP API proposal/draft from Jussi Kalliokoski on 2012-07-26 (public-audio@w3.org from July to September 2012)

From: Jussi Kalliokoski <jussi.kalliokoski@gmail.com>
Date: Fri, 27 Jul 2012 01:24:41 +0300
To: Marcus Geelnard <mage@opera.com>
Cc: public-audio@w3.org
Message-ID: <CAJhzemXi1uh7txj_PdKZCdaVKGxaeRnpZVg6ueTQYTsfjCFWRA@mail.gmail.com>
On Tue, Jul 24, 2012 at 12:00 PM, Marcus Geelnard <mage@opera.com> wrote:

> Well, I looked at both interleaved and non-interleaved, and I'm not really
> in strong favor of either of them. I'm currently looking for a strong
> argument for going either way.
>
> For instance, most C++ FFT libraries use interleaved data, which is only
> natural because the language supports array of structs / complex types
> natively, so for the FFT interface it might be better to use interleaved
> data (not really sure yet what the penalty for non-interleaved data is,
> though).
>

Real men use C FFT libraries. ;)


> On the other hand, from an API perspective, interleaved data easily
> becomes awkward. For instance, how would you differentiate between a
> complex and a real-valued array? With non-interleaved data its quite clear:
> either you have an imaginary component (=complex) or you don't (=real).
>

You pass in a channel count / stride argument.


> One of the reasons for going with non-interleaved data was that the audio
> API uses non-interleaved data (for multi-channel buffers). Though a bit
> unrelated, it made me lean towards non-interleaved.


Yes, I don't really agree with that choice either. Maybe Chris can share
the rationale behind it when he's back.


> I suppose by imag multiply, you mean complex multiply ;)
>

Touché. :)


> True, complex multiplication is very useful, especially for FFT-based
> filtering (such as your example). Though, I suppose that your example could
> have been implemented using convolution instead?
>

It could be, but doesn't really help as a convolution module can take
advantage of storing the FFT of the impulse response, whereas here you have
two varying signals.


> I added complex multiplication and division to the draft.


Excellent!


> Here, I think that the easiest solution is the best solution ;)
>
> We can just use plain 1:1 mapping, and rely on WEBIDL/TypedArray format
> conversion rules. Any scaling (e.g. *32767 for going to Int16) can be done
> manually before interleaving or after deinterleaving.
>
> I added interleave() and deinterleave() to the draft, though I used a
> maximum of four components, since that's what most SIMD instruction sets
> support well anyway (and most data of interest has <= 4 components), and
> with the offset & stride arguments you can always pack/unpack more
> components in several passes.


Agreed. Yes, makes sense.


>
>
>  Well, filtering, convolution, FFT and interpolation are all 1D. If we
>>> could agree on some common way to interpret 2D matrices in terms of typed
>>> arrays, we could add 2D versions of these methods (similar to the Matlab
>>> methods [1], [2] and [3], for instance).
>>>
>>>
>> There's some heavy precedence already, like I said WebGL matrices and
>> Canvas ImageData are all interleaved and just in series and the matrix
>> shape is stored elsewhere, and I hardly think that it's going to change
>> any
>> time soon. So my suggestion is expanding the signature of the methods, for
>> example FFT:
>>
>> FFT (uint size0, uint size1, uint size2, ... uint sizeX)
>>
>> This would already provide all that you need for running FFTs on the
>> existing data types on the web, for example for an imagedata:
>>
>> var fft = new FFT(imgdata.width, imgdata.height)
>> fft.forward(imgdata.data)
>>
>
> Yes, extending the FFT interface for doing 2D would be quite easy.
>
> IIR filtering, as you said, does not make much sense in 2D (well, it's
> probably useful, but you'd have to define a filtering direction, and it's
> not at all as commonly used as 2D FIR filters anyway). Also, the notion of
> continuous intra-buffer operation is not as commonly used for 2D signals
> (you typically have bounded images).
>

Agreed, defining an IIR filter interface that works well in 2D would be
hard. And probably not a priority.


> I think a new interface would be needed for 2D convolution. 3D (and
> higher?) convolution MIGHT be of interest, but then I think we're stepping
> into the realm of non-realtime applications, and you might actually get
> away with current JS performance.
>

At least 3D convolution is pretty neat stuff, and maybe not real-time yet,
but I think soon it might be. You could do interesting optimizations with
physics engines by recording 3D impulse responses. For example record what
happens to a brick wall when it's smashed with a huge hammer.


> For 2D interpolation, I think we have to come up with actual use-cases,
> because there are quite a few different ways of doing interpolation
> (uniform, non-uniform along x/y independently, non-uniform with x/y
> specified per output-element,...).


Hmm, true, it's hard nut to crack.


>
>  Or a WebGL 3D matrix:
>>
>> var fft = new FFT(matrixWidth, matrixHeight, matrixDepth)
>> fft.forward(matrix)
>>
>
> I think that 3D FFT (or higher order FFT:s) is of quite limited use, but I
> could be wrong. For complicated data analysis, yes, for WebGL, hmmm... I
> have a hard time imagining someone wanting to do the FFT of a 3D texture,
> for instance, especially in real time. That's typically something you'd do
> off line I guess (if at all).
>

See above. ^^ My example might be far-fetched though.


> All in all, I'm starting to think: What real time operations could you do
> with 2D DSP functions that you can't do with WebGL/GLSL? Filtering and
> interpolation can easily be done in GLSL, e.g. for implementing funky video
> effects for your camera app.
>
> I think we need to gather more use cases for >= 2D applications to be able
> to design proper interfaces. The only use cases that I can think of right
> now that can't be done in WebGL but still needs to have real time
> performance are:
>

WebGL isn't very ideal for any kind of analysis, so for example calculating
the convolution kernels for (smart) deblurring would be a prime target.


> 1) Real time camera input analysis (face recognition, feature detection,
> ..?)
>

This +1. Also the same for 3D, people are already doing stuff with kinect
and JS, not to mention what's going to happen when I get my hands on a
LeapMotion. ;)


> 2) Real time video decoding of codecs that are not natively supported.
>
> I'd say that 2) isn't really a practical solution, but more of a tech
> demo. 1) on the other hand might be useful for things like RTC, augmented
> reality, etc, but I have no clue what the critical operations or interface
> requirements for them would be.
>

I'm no expert in this, but I'd imagine filtering is the most crucial part
any kind of recognition, be it face, gesture, feature, motion or whatever.
Probably the process is something like filter out stuff like the
background, produce a heatmap based on resemblance to skin color (I
wouldn't bet on this), try to discern shapes like eyebrows, mouths and
eyes, then if there's a time component detect humane face motions. For most
of those, my first bet would be a filter. :)

Cheers,
Jussi
Received on Thursday, 26 July 2012 22:25:09 UTC