Re: Integer PCM sample formats to Web Audio API?

On Tue, Jan 14, 2014 at 2:48 PM, Chris Wilson <cwilso@google.com> wrote:

> If int16 buffers don't offer something approximating actual guarantees,
>> you haven't fixed anything - that native port will still have to assume the
>> worst (i.e. using 2x as much memory) and be rewritten to work with a tiny
>> address space, making your int16 buffer optimization nearly meaningless -
>> sure, the mixer might be slightly faster/slower and the process's resident
>> memory use will be lower, but it won't enable any new use cases and certain
>> ports will still be out of the question.
>>
>
> What's a "guarantee"?  Even if we mandated, with a MUST, that
> implementations MUST use native 16-bit storage when requested,
> implementations might choose not to do that as a performance/battery
> optimization.  They wouldn't be conforming, but they would work.
>

The most obvious analogue here is the way textures work in OpenGL and
Direct3D. You allocate a texture of a particular size in a particular
format, and that's what you get. The GPU is certainly free to take
liberties with the actual arrangement of the texture in video memory (and
in fact, most do), but the format you ask for is (IIRC) always the format
you get. This is important because having extra format conversions
introduced at the driver's discretion could result in unprecedented
performance consequences or even behavioral differences (due to too
much/too little precision). I don't see how audio could really differ
dramatically in this area, unless I've overlooked something important. I've
love to see examples of how audio is somehow special in this regard.


>
> The AudioContext's sampleRate is not set to a defined number, but in
> practice the sampleRate is set to the audio output sample rate - that is,
> the AudioDestinationNode's native rate - since that's where the clock is
> coming from.  The point is that the entire audio context is run in a single
> rate, to minimize resampling.
>

Arguably the choice to mix in 32-bit float should be equivalent to a choice
to mix in 44khz or 48khz. It shouldn't have to influence the source format
of audio data any more than the output rate would require source audio to
be stored at that rate. This is the point I'm trying to make: both bitness
and sample rate are important controls to have over source audio.


>
> Having such an option in the API gives the implementation an opportunity
>>> to save memory when memory is scarce, but it's not necessarily forced to do
>>> so.
>>>
>>
>> The whole point is to force the implementation to save memory. An
>> application that runs out of memory 80% of the time is not appreciably
>> better than one that does so 100% of the time - end users will consider
>> both unusable.
>>
>
> Given all the other factors that may change memory usage in the web
> platform, I'm not sure why this one feature will solve that problem.  Or
> even come close.  Again, I'm not saying I see no reason to look closely at
> this; I'm just saying that I don't think this is as big a slam dunk as you
> appear to, and I think there are notable situations when it is better to
> NOT store that data in int16, and there will be
>

What situations are these? I find it hard to imagine a scenario where
software playback is going to benefit tremendously from using 2x the memory
to store sample data. Certainly there are huge advantages to *mixing* in
floating-point; are you arguing that making the mixer slightly faster
merits using double the memory (and thus, double the memory bandwidth, if
not more - memory bandwidth being especially precious on mobile platforms)?
Must the floating-point version of said buffer be the de-facto storage
format even though it is merely a minor mixing efficiency optimization?

I am dubious about the tremendous cost implied by converting from int16 to
float32 in the mixer, also. It's a trivial, common operation, and depending
on architecture I would expect it could pay for itself in reduced memory
bandwidth usage and more efficient use of L1/L2/L3 caches. Have you
benchmarked this? Do you have test cases that demonstrate a tremendous
performance win by using float32 for everything versus int16 or int8?


>
>
>> On this whole subject it is important to realize that when talking about
>> developers porting games and multimedia software from other native
>> platforms, it is usually not wise to assume they are idiots that will shoot
>> themselves in the foot.
>>
>
> That was not the intent, and I was certainly not making that assumption.
>  However, those aren't the only developers that would have this API
> available - and I would venture some of them would choose to make this
> decision without understanding how it may affect them on other devices or
> browsers, now and in the future.  Mostly because that's pretty much
> impossible to know.
>

Sacrificing current-day usability in favor of some hypothetical future
platform is not a wise decision when we are dealing with existing software.
Furthermore, I would argue that we have no proven ability to anticipate
future hardware configurations any better than the developers of these
games and multimedia applications. To a large extent, these applications
have been doing mixing and playback the same way for over a decade, and
numerous changes/improvements to hardware have not significantly impacted
things, other than cases where buffers moved to/from hardware and
filtering/mixing got slightly more sophisticated. The underlying model has
not significantly changed. The same is true for graphics rendering: We
still basically have buffers containing vertex data, index data, and texel
data - as we have since the early days of modern-era OpenGL/Direct3D.

I simply cannot imagine a realistic scenario where locking buffers to
float32 is optimizing effectively for a future hardware configuration; let
alone one that cannot easily cope with int16 (let alone benefit from it).
Such an architecture would have significant problems running pretty much
any modern software; modern software loves integers. Hell, the JS runtime
still freely mixes floats & ints and does frequent conversions.

I will agree that we cannot know how this will affect future
devices/browsers, especially ones with odd architectures. However, is there
really a good reason to compromise usability and performance on current
architectures - ones used by the vast majority of living, breathing, paying
customers - in favor of customers that don't exist because the devices
haven't been made or bought yet?


>
> Yes, developers make mistakes, and they ship broken software that relies
>> on bugs in browser implementations - I can understand the reluctance to
>> give developers more ways to make mistakes.
>>
>
> It's not "reluctance to give developers more ways to make mistakes" at
> all.  It's "caution in exposing low-level platform implementation details
> unless you are absolutely, positively certain it can be made a net win
> overall."  Every low-level implementation detail that's exposed makes it
> that much harder for the web platform to scale across devices, and puts
> more onus in the developer to own that scalability; that begs for caution.
>

The source format (bitness & sample rate) of audio is not 'a low level
platform implementation detail' any more than the pixel format of a source
image is a low level platform implementation detail. The file formats audio
is loaded from and rendered to contain this information; authors select it
explicitly given particular tradeoffs (i.e. doing some recording at high
sample rates then mixing down to lower sample rates). You cannot simply
hide it behind a wall and pretend it doesn't exist. We're not talking about
abstractions here like those in 3D rendering, where the exact mechanics of
fragment rendering and vertex layout are left up to the vendor (as long as
they satisfy the requirements in the spec); we are literally talking about
foundational details here. As I mentioned before, such abstraction would
not be tolerated for textures in rendering (though you could certainly
offer it as an 'opt-in' way to somehow save on memory and texture
bandwidth).


>
>
>> In these scenarios, we have working applications that do interesting
>> things on native platforms, and if you significantly undermine the Web
>> platform's ability to deliver parity in these scenarios, you're not
>> protecting native app developers from anything, all you're doing is keeping
>> them off the Web and stuck in walled garden App Stores.
>>
>
> All I'm saying is "parity does not mean do it the same way," and pointing
> out that the Web platform is supposed to scale across different hardware
> and devices better, I think, that previous platforms have done.
>
> Again, I would point out that making a change that would allow developers
> to force the integer storage of buffers would have negative side effects,
> and all I'm cautioning is those should be carefully examined and weighed.
>  I would postulate a set of developers would say "well of course, my data
> is 16-bit 22kHz, of course I want to force the data to be stored that way
> to save memory!" without considering that by doing so, they are going to be
> burning battery life (aka CPU time).  That's not always the right tradeoff.
>

I'm not advocating that everything must be done the same way. I'm
advocating for having an actual solution for this problem instead of
continuing to wave your hands using (at least in my history following this
list) wholly unstated hypothetical future use cases as justification. You
don't have to rearchitect the whole Web Audio pipeline or introduce a
sweeping set of new features, just provide a real-world solution for
controlling the (already extreme) memory usage of AudioBuffers.

P.S. in graphics scenarios we've been relying heavily on compressed storage
of texel data in memory for over a decade, because it turns out we never
have enough memory to store all our data. Given that the size of these
float32 audiobuffers is problematic in reality for existing game demos,
perhaps it could be worthwhile to use efficient in-memory compression for
audio? It is certainly the case that lots of real-world games do streaming
decompression for some of their audio (i.e. music and voiced dialogue)
instead of decoding it up front into enormous buffers. Note that I am not
advocating for streaming *from storage*, I am advocating for streaming
*from memory*. The XBox 360 actually has support for this in the
southbridge, if memory serves.

Received on Wednesday, 15 January 2014 04:03:05 UTC