Re: TAG feedback on Web Audio from Srikumar Karaikudi Subramanian on 2013-08-08 (public-audio@w3.org from July to September 2013)

From: Srikumar Karaikudi Subramanian <srikumarks@gmail.com>
Date: Thu, 8 Aug 2013 09:34:49 +0530
To: Noah Mendelsohn <nrm@arcanedomain.com>
Cc: robert@ocallahan.org, Jer Noble <jer.noble@apple.com>, "K. Gadd" <kg@luminance.org>, Chris Wilson <cwilso@google.com>, Marcus Geelnard <mage@opera.com>, Alex Russell <slightlyoff@google.com>, Anne van Kesteren <annevk@annevk.nl>, Olivier Thereaux <Olivier.Thereaux@bbc.co.uk>, "public-audio@w3.org" <public-audio@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-Id: <5FF16133-F84D-4256-B9C0-5771C4CAFE90@gmail.com>
> The benchmark I was proposing was strictly copying bytes, without context switches or reliance on OS piping services, to see how fast the hardware can do that (and making sure the patterns are likely to use the cache in about the same way as the API). Such a measurement sets a bound on the overhead from >copying<, which I thought was the question on the table?

You're right that I'm not measuring what you suggested we measure. I read your above paragraph to mean that you intended to measure something close to the *upper* bound on copying performance. I tend to be skeptical of combining multiple "subsystem peak performance" benchmarks to estimate what might occur when mixing subsystems, since in today's computers subsystem interaction can easily lead to radically different behaviours (a context switch may page a buffer to the disk, for instance). So I tend to measure possible *lower* bounds on performance instead ... particularly if a quick estimate is possible.

As for the pretty simple floating point operations introduced, they have a non-trivial impact on the performance. For buffer of 128 or so, up to 512, throughput improves about 20% or so. For 4096 buffers, the throughput nearly doubles if you remove the fops! ... while removing the malloc barely changes all figures by 5%. So, you see, if you design based on peak copy bandwidth figures, a "floating point ops won't affect this much" mindset could throw a few surprises. (The program has command line parameters to turn on/off the floating point ops as well as the malloc.)

This way, surely I don't have an accurate picture, but it's useful in a happy sort of way - where if the lower bound performance turns out to be good enough, anything better than that is only cause for joy.

-Kumar

On 8 Aug, 2013, at 8:55 AM, Noah Mendelsohn <nrm@arcanedomain.com> wrote:

> I do really appreciate your quick effort to respond to my suggestion. I'm a little concerned, though, as to whether what you're measuring is actually in the spirit of what I suggested.
> 
> You're using Unix pipes with read and write. Do we know enough about the implementation of the pipes to be sure you're measuring patterns that match what some particular audio API would do? I suspect not. I do agree that your resutls suggests that in your implementation context switch overhead is significant, and that may well be the reason your throughput rises with packet size.
> 
> The benchmark I was proposing was strictly copying bytes, without context switches or reliance on OS piping services, to see how fast the hardware can do that (and making sure the patterns are likely to use the cache in about the same way as the API). Such a measurement sets a bound on the overhead from >copying<, which I thought was the question on the table?
> 
> I also note that you have some floating point operations in there. They are likely swamped for small buffers by your context switch overhead, but if you get rid of the context switches I wouldn't be surprised that those floating point operations would prove significant.
> 
> Indeed, years ago when we were building our parser someone on our team was playing around and happened to include a mod (% operator) in much the same way you're using that floating point conversion/multiply (which your compiler may or may not be optimizing out). It took us a while to realize why our results were anomalous: % tends to involve an integer divide, and on many machines divide times are significant relative to word access times.
> 
> If what's to be benchmarked is memory copy time in buffers sufficiently large to miss in 1st/2nd level cache (which seems a reasonable approximation to the audio case), then that's what should be benchmarked. I'd be suspicious of anything that involves context switching, floating point ops, pipes, etc.
> 
> Of course, if the audio APIS will necessarily involve OS-level context switches, that should be evaluated too, but I'd suggest decoupling the context-switch benchmarks from the memory copy benchmarks.
> 
> Noah
> 
> On 8/7/2013 10:01 PM, Srikumar Karaikudi Subramanian wrote:
>> I did a quick test to see what's possible on my laptop (MacBook Air 1.7GHz,
>> core i5).
>> 
>> https://gist.github.com/srikumarks/6180450
>> 
>> The C program forks off a child process and the two keep sending one
>> float32 buffer of a given size back and forth.  The interesting thing that
>> came up in my trial runs is that the data throughput is severely affected
>> by the buffer size and not as much (relatively) by whether the buffer is
>> malloced fresh and filled for every send. Using a 128 sample buffer, I got
>> a throughput around 45MB/s, but with a 4096 sample buffer, I got about
>> 400MB/s. Both measurements done with a fresh malloc and fill for every send.
>> 
>> These numbers suggest that in the case of audio, the data throughput is not
>> a bottleneck, but the process switching overhead is. However, even with the
>> 128 case, > 200 such mono streams can be sent back and forth. This number
>> is relevant when you have N number of script nodes in a chain before
>> hitting the audio destination node.
>> 
>> When considering 5.1/48KHz audio, the length of the buffer in each
>> send/recv is 768 samples, and I got, again, about 150 such streams possible
>> in such a chain. The data throughput in this case was about 160MB/sec.
>> 
>> (All throughput numbers are "pessimized" values. See gist for real figures.
>> I did not exit any of my other running applications to run this test.)
>> 
>> -Kumar
>> 
>> 
>> On 8 Aug, 2013, at 3:14 AM, "Robert O'Callahan" <robert@ocallahan.org
>> <mailto:robert@ocallahan.org>> wrote:
>> 
>>> On Thu, Aug 8, 2013 at 8:11 AM, Noah Mendelsohn <nrm@arcanedomain.com
>>> <mailto:nrm@arcanedomain.com>> wrote:
>>> 
>>>    Now ask questions like: how many bytes per second will be copied in
>>>    aggressive usage scenarios for your API? Presumably the answer is
>>>    much higher for video than for audio, and likely higher for
>>>    multichannel audio (24 track mixing) than for simpler scenarios.
>>> 
>>> 
>>> For this we need concrete, realistic test cases. We need people who are
>>> concerned about copying overhead to identify test cases that they're
>>> willing to draw conclusions from. (I.e., test cases where, if we
>>> demonstrate low overhead, they won't just turn around and say "OK I'll
>>> look for a better testcase" :-).)
>>> 
>>> Rob
>>> --
>>> Jtehsauts  tshaei dS,o n" Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
>>> le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o  Whhei csha iids
>>> teoa stiheer :p atroa lsyazye,d  'mYaonu,r  "sGients  uapr,e  tfaokreg
>>> iyvoeunr, 'm aotr  atnod  sgaoy ,h o'mGee.t"  uTph eann dt hwea lmka'n?
>>> gBoutt  uIp  waanndt  wyeonut  thoo mken.o w *
>>> *
>>
Received on Thursday, 8 August 2013 04:05:28 UTC