Re: TAG feedback on Web Audio from Noah Mendelsohn on 2013-08-07 (public-audio@w3.org from July to September 2013)

From: Noah Mendelsohn <nrm@arcanedomain.com>
Date: Wed, 07 Aug 2013 16:11:31 -0400
To: Jer Noble <jer.noble@apple.com>
CC: "K. Gadd" <kg@luminance.org>, Srikumar Karaikudi Subramanian <srikumarks@gmail.com>, Chris Wilson <cwilso@google.com>, Marcus Geelnard <mage@opera.com>, Alex Russell <slightlyoff@google.com>, Anne van Kesteren <annevk@annevk.nl>, Olivier Thereaux <Olivier.Thereaux@bbc.co.uk>, "robert@ocallahan.org" <robert@ocallahan.org>, "public-audio@w3.org" <public-audio@w3.org>, www-tag@w3.org
Message-ID: <5202A9F3.8080008@arcanedomain.com>

I am not expert in details of the proposed interfaces, but I do have 
experience with research and development work relating to the performance 
implications of memory-to-memory copying of large data structures, and I'l 
like to make a suggestion based on that experience.

Latency and unpredictable latencies are surely a concern for audio, as 
described below, but my main point would be to suggest that you do a >> 
quantitative analysis of the CPU overhead for memory-to-memory copies<<.

In the case of audio or video, this would likely be done by running 
benchmarks of toy kernerls that are realistic not just in terms of data 
rates, but also in terms of likely memory access patterns (a loop copying 
data between two small buffers will of likely have higher performance and 
much less time spent with the CPU waiting for memory than a loop working 
through streaming buffers).

In my experience, you can from such benchmarks easily estimate the data 
copying rate that will saturate the CPU. This will vary from machine model 
to machine model, but you can take a typical machine, say 2MHz, and put (at 
least one core) in a loop reading and writing buffers in a pattern typical 
of use of your API.

Now you know that doing that much access will leave no CPU time for 
anything else. Copying at half that rate will leave 50% of your CPU free. 
Yes, with multi-core systems you have to figure out how much memory access 
from one core stalls the other cores, which is also not hard to measure 
roughly.

Now ask questions like: how many bytes per second will be copied in 
aggressive usage scenarios for your API? Presumably the answer is much 
higher for video than for audio, and likely higher for multichannel audio 
(24 track mixing) than for simpler scenarios.

With this in hand, you can do at least very rough quick calculations of how 
much of a typical computer's capability will be tied up doing 
memory-to-memory copies in various scenarios. Then let the chips fall where 
they may. If you can afford it, fine do the copies. If you can't this 
should be useful way to find that out without spending months building and 
debugging the whole API. My guess is that with just a few audio tracks and 
today's fast machines a little copying isn't fatal, but I'd guess that for 
multiple high quality 24 bit uncompressed audio streams or for video, the 
overhead of copying will likely prove significant.

BTW: the work I did was not on audio, but on XML Parsing. The paper is here 
[1], and you'll see that the nature of the analysis is as described above. 
The benchmarks suggested above proved to predict quite well the performance 
of the parsing system. The short version of the conclusion: many of these 
applications are performance-limited by the speed of RAM; caches don't help 
much for streaming applications, and except where you have SMT (multiple 
thread/core), a given core will be stalled doing nothing while waiting for 
memory.

Noah

[1] http://www2006.org/programme/files/pdf/5011.pdf

On 8/7/2013 12:24 PM, Jer Noble wrote:
>
> On Aug 6, 2013, at 6:56 PM, K. Gadd <kg@luminance.org
> <mailto:kg@luminance.org>> wrote:
>
>> Given flash's continued prevalence for streaming realtime audio/video and
>> use cases like video chat and game rendering, whatever works for the
>> pepper flash plugin should probably work okay for web audio (though it
>> certainly isn't necessarily *needed*).
>
> I believe this is a faulty premise.
>
> Video, especially 24 FPS video, is much more immune to jitter and delays
> than is audio.  In the video case, frames are 30 to 40 ms apart, and if a
> frame is delayed by a few ms, that delay is almost imperceptible to the
> human eye.
>
> With 44,100 kHz audio and a small frame size (128 samples in Blink & WebKit
> Web Audio, usually), each batch of samples is only 3 ms apart, and if those
> samples are delayed even by less than a millisecond, there will be an
> audible ‘pop’ in the rendered audio.
>
> Additionally, I would assume that in Chrome the video is piped back to the
> main process for rendering, but the audio is rendered locally in the plugin
> process.
>
> -Jer

Received on Wednesday, 7 August 2013 20:11:52 UTC