Re: More thoughts on buffer mapping (no improvement on mapAsync sadly) from Ken Russell on 2020-02-28 (public-gpu@w3.org from February 2020)

From: Ken Russell <kbr@google.com>
Date: Fri, 28 Feb 2020 12:05:48 -0800
To: Corentin Wallez <cwallez@google.com>, Jeff Gilbert <jgilbert@mozilla.com>, Dzmitry Malyshau <dmalyshau@mozilla.com>
Cc: internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
Message-ID: <CAMYvS2fYeYYUuMtxx_MSGagZc=8c7TRmJPYEvdfVEYfBy9cjxg@mail.gmail.com>
On Fri, Feb 28, 2020 at 8:35 AM Corentin Wallez <cwallez@google.com> wrote:

> On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com> wrote:
>
>> On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com>
>> wrote:
>>
>>> Whoops, meant to send this to public-gpu.
>>>
>>> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com>
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I spent quite some time yesterday trying to understand how alternatives
>>>> like Synchronous mapping #506
>>>> <https://github.com/gpuweb/gpuweb/pull/506> or Failable mapping #511
>>>> <https://github.com/gpuweb/gpuweb/pull/511> could be implemented and
>>>> which modifications they would need to be implementable. At the end of the
>>>> day an important property became clear for implementations looking to
>>>> minimize copies using shared memory between the content process and the GPU
>>>> process. Below I'm using terminology from buffer mapping proposals but the
>>>> same is true for proposals that would introduce new staging buffer objects.
>>>>
>>>> *Accurate tracking on the content process of which mappable buffers are
>>>> in use is not tractable.*
>>>>
>>>> We want to avoid races so that the data written by JS at the time it
>>>> says call unmap is what gets seen by the GPU until the next mapping
>>>> operation (either mapSync, or mapAsync promise resolution). Without some
>>>> asynchronous ownership transfer (mapAsync like) I claim it is not tractable
>>>> to know accurately on the content-process side whether it is safe to write
>>>> to the shared memory region.
>>>>
>>>> The first reason why it is not tractable, is that we don't know
>>>> accurately if queue submits that use the mappable resource are finished.
>>>> While it is "easy" to follow progress of queues themselves, knowing which
>>>> submits use the resource would require adding duplicated content-side
>>>> tracking in bindgroups, all encoders, command buffers, bundles to know what
>>>> mappable resources they hold. This would be a many-fold increase in the
>>>> amount of tracking that WebGPU requires on the content side.
>>>>
>>>
>> Is this true for all WebGPU implementations, or Chrome's implementation
>> specifically? If such tracking were implemented, would it impose an
>> unacceptable performance overhead? From my relatively uninformed
>> perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of
>> all of the buffers or other mappable resources referenced by these data
>> structures during command encoding and queue submission seems tractable.
>> The tree of data structures that would have to be traversed doesn't seem
>> that deep. The area that would have to be most optimized would be
>> communication from the GPU process back to the content process that those
>> buffers are no longer referenced. Most likely that would have to be handled
>> with atomic operations in shared memory rather than via IPCs.
>>
>
>>
> This would be true to varying degrees for all multiprocess
> implementations. For example at the moment in Chromium the content side
> data for a GPUBindGroup is just a pointer handle that itself contains two
> uint32_t handles, and that's the case for most other WebGPU Blink side
> objects. It's nice to be able to keep that if possible.
>

Understood that it would be nice to keep things simple.

The question I'd like answered is - if this tracking were implemented, and
if it wasn't too costly - then would the "Consequences for the existing
proposals" below change? It seems to me that, for example, fallible mapping
would be a viable possibility in that scenario. The analyses below assumed
a priori that client-side tracking was intractable.

A follow-on question - how difficult would it really be to add conservative
client-side tracking? Could other WebGPU implementers provide their
opinions too?

-Ken


The content-side can already see progress on the GPU without the need for
> atomics with GPUFence signaling which would also help make the effects of a
> buffer being no-longer in use happen when the application can observe the
> completion of the GPU work.
>
>>
>> Second, even doing the tracking above is not enough for accurate tracking
>>>> because it doesn't take into account error propagation. What if one of the
>>>> commands in the submit that uses the mappable resource is an error? Is the
>>>> resource still considered in use by the content side when the GPU process
>>>> side does nothing? One solution would be to duplicate all the validation on
>>>> the content side, but that's way too expensive, and can't help with error
>>>> objects due to OOM.
>>>>
>>>
>> Conservatively assuming that the resources are referenced even if the
>> command will be an error seems fine from the standpoint of writing portable
>> content. I agree wholeheartedly that the validation should not be
>> duplicated on the content side, but it seems to me that a conservative
>> approximation would work well enough. Errors are neither the common nor
>> expected case.
>>
>> Hoping that if we can iterate a bit on this discussion, that some of the
>> analyses and conclusions below might be able to be revisited.
>>
>> It's correct that the only path forward that's different than mapAsync is
> one where there is a conservative estimation of whether a buffer is mapped
> or not, that only has false-nagetives (i.e. the buffer can be considered in
> use when it is not, but not the contrary).
>
>
>> -Ken
>>
>>
>> *Consequences for the existing proposals.*
>>>>
>>>> Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506> has
>>>> two alternatives, either the mapping operation blocks when the buffer is in
>>>> use, or a new staging area is returned that will be copied in the buffer at
>>>> a later time. Since accurate tracking isn't possible, either WebGPU will
>>>> most often do a roundtrip to the GPU process (aka block) or it will most
>>>> often introduce an extra copy.
>>>>
>>>> Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> has
>>>> the mapping operation return null when the buffer is in use. This requires
>>>> knowing accurately when it is in use and is not tractable.
>>>>
>>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697>
>>>> Myles' comment about mapAsync
>>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where
>>>> a desirable property would be that the mapAsync promise resolve immediately
>>>> if the buffer is currently not in use. Guaranteeing this at the spec level
>>>> requires accurate tracking of which buffers are in use so it isn't
>>>> possible. There needs to be at least one roundtrip to the GPU process to
>>>> "lock" the resource.
>>>>
>>>> *Another direction using APIs that allow wrapping CPU memory into a
>>>> buffer*
>>>>
>>>> The following APIs allow taking an OS memory object, or just a pointer,
>>>> and turn it into a GPU resource:
>>>>
>>>>    - ID3D12Device3::OpenExistingHeapFromAddress
>>>>    <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress>
>>>>    - MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>>>>    <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc>
>>>>    - VK_EXT_external_memory_host
>>>>    <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host>
>>>>
>>>> These APIs would allow true zero-copy between JS and the GPU and
>>>> allocating staging memory in the content process synchronously via the
>>>> following mechanisms:
>>>>
>>>>    - In the content process, allocate shared-memory between the
>>>>    content process and GPU process (at least in Chromium that's possible).
>>>>    - Send the shared memory to the GPU process.
>>>>    - In the GPU process create a new resource by wrapping that shared
>>>>    memory region (or allocating a new resource if it's not possible).
>>>>    - In the content process, write to the memory then send an unmap
>>>>    signal to the GPU process.
>>>>    - On the GPU process, if wrapping was not possible, copy from the
>>>>    shmem to the GPU resource.
>>>>    - Profit!
>>>>
>>>> An idea I was exploring is having something like mapSync that can
>>>> replaces the allocation of a GPUBuffer to a new native buffer via the
>>>> mechanism described above. However a design constraint we have been
>>>> operating with is that a WebGPU resource is exactly a native API resource
>>>> so that doesn't work either. (imagine we baked bindgroups with the address
>>>> of the buffer, we want to avoid needing dirtying mechanisms).
>>>>
>>>> *Conclusion*
>>>>
>>>> Like the other times I tried, I wasn't able to come up with a better
>>>> solution than mapAsync. It's the only one that works so far but the
>>>> asynchrony makes it a bit difficult for people to use so it'd be nice to
>>>> have an alternative.
>>>>
>>>> At least I learnt an important design constraint, and discovered that
>>>> it is possible to wrap CPU memory in a GPU resource to optimize things.
>>>> Also I started a HackMD to discuss tradeoffs again
>>>> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. It doesn't have
>>>> content but at least it has links to all the proposals if you want to keep
>>>> it as a set of bookmarks.
>>>>
>>>> Cheers,
>>>>
>>>> Corentin
>>>>
>>>
Received on Friday, 28 February 2020 20:06:17 UTC