Re: More thoughts on buffer mapping (no improvement on mapAsync sadly) from Corentin Wallez on 2020-02-28 (public-gpu@w3.org from February 2020)

From: Corentin Wallez <cwallez@google.com>
Date: Fri, 28 Feb 2020 17:35:02 +0100
To: Ken Russell <kbr@google.com>
Cc: internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
Message-ID: <CAGdfWNM1NztAA8o=Zduus3-+++jcbVJeUY1VQQbgBUgEe5ba5g@mail.gmail.com>
On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com> wrote:

> On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com>
> wrote:
>
>> Whoops, meant to send this to public-gpu.
>>
>> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com>
>> wrote:
>>
>>> Hey all,
>>>
>>> I spent quite some time yesterday trying to understand how alternatives
>>> like Synchronous mapping #506
>>> <https://github.com/gpuweb/gpuweb/pull/506> or Failable mapping #511
>>> <https://github.com/gpuweb/gpuweb/pull/511> could be implemented and
>>> which modifications they would need to be implementable. At the end of the
>>> day an important property became clear for implementations looking to
>>> minimize copies using shared memory between the content process and the GPU
>>> process. Below I'm using terminology from buffer mapping proposals but the
>>> same is true for proposals that would introduce new staging buffer objects.
>>>
>>> *Accurate tracking on the content process of which mappable buffers are
>>> in use is not tractable.*
>>>
>>> We want to avoid races so that the data written by JS at the time it
>>> says call unmap is what gets seen by the GPU until the next mapping
>>> operation (either mapSync, or mapAsync promise resolution). Without some
>>> asynchronous ownership transfer (mapAsync like) I claim it is not tractable
>>> to know accurately on the content-process side whether it is safe to write
>>> to the shared memory region.
>>>
>>> The first reason why it is not tractable, is that we don't know
>>> accurately if queue submits that use the mappable resource are finished.
>>> While it is "easy" to follow progress of queues themselves, knowing which
>>> submits use the resource would require adding duplicated content-side
>>> tracking in bindgroups, all encoders, command buffers, bundles to know what
>>> mappable resources they hold. This would be a many-fold increase in the
>>> amount of tracking that WebGPU requires on the content side.
>>>
>>
> Is this true for all WebGPU implementations, or Chrome's implementation
> specifically? If such tracking were implemented, would it impose an
> unacceptable performance overhead? From my relatively uninformed
> perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of
> all of the buffers or other mappable resources referenced by these data
> structures during command encoding and queue submission seems tractable.
> The tree of data structures that would have to be traversed doesn't seem
> that deep. The area that would have to be most optimized would be
> communication from the GPU process back to the content process that those
> buffers are no longer referenced. Most likely that would have to be handled
> with atomic operations in shared memory rather than via IPCs.
>

>
This would be true to varying degrees for all multiprocess implementations.
For example at the moment in Chromium the content side data for a
GPUBindGroup is just a pointer handle that itself contains two uint32_t
handles, and that's the case for most other WebGPU Blink side objects. It's
nice to be able to keep that if possible.

The content-side can already see progress on the GPU without the need for
atomics with GPUFence signaling which would also help make the effects of a
buffer being no-longer in use happen when the application can observe the
completion of the GPU work.

>
> Second, even doing the tracking above is not enough for accurate tracking
>>> because it doesn't take into account error propagation. What if one of the
>>> commands in the submit that uses the mappable resource is an error? Is the
>>> resource still considered in use by the content side when the GPU process
>>> side does nothing? One solution would be to duplicate all the validation on
>>> the content side, but that's way too expensive, and can't help with error
>>> objects due to OOM.
>>>
>>
> Conservatively assuming that the resources are referenced even if the
> command will be an error seems fine from the standpoint of writing portable
> content. I agree wholeheartedly that the validation should not be
> duplicated on the content side, but it seems to me that a conservative
> approximation would work well enough. Errors are neither the common nor
> expected case.
>
> Hoping that if we can iterate a bit on this discussion, that some of the
> analyses and conclusions below might be able to be revisited.
>
> It's correct that the only path forward that's different than mapAsync is
one where there is a conservative estimation of whether a buffer is mapped
or not, that only has false-nagetives (i.e. the buffer can be considered in
use when it is not, but not the contrary).


> -Ken
>
>
> *Consequences for the existing proposals.*
>>>
>>> Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506> has
>>> two alternatives, either the mapping operation blocks when the buffer is in
>>> use, or a new staging area is returned that will be copied in the buffer at
>>> a later time. Since accurate tracking isn't possible, either WebGPU will
>>> most often do a roundtrip to the GPU process (aka block) or it will most
>>> often introduce an extra copy.
>>>
>>> Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> has
>>> the mapping operation return null when the buffer is in use. This requires
>>> knowing accurately when it is in use and is not tractable.
>>>
>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697>
>>> Myles' comment about mapAsync
>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where
>>> a desirable property would be that the mapAsync promise resolve immediately
>>> if the buffer is currently not in use. Guaranteeing this at the spec level
>>> requires accurate tracking of which buffers are in use so it isn't
>>> possible. There needs to be at least one roundtrip to the GPU process to
>>> "lock" the resource.
>>>
>>> *Another direction using APIs that allow wrapping CPU memory into a
>>> buffer*
>>>
>>> The following APIs allow taking an OS memory object, or just a pointer,
>>> and turn it into a GPU resource:
>>>
>>>    - ID3D12Device3::OpenExistingHeapFromAddress
>>>    <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress>
>>>    - MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>>>    <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc>
>>>    - VK_EXT_external_memory_host
>>>    <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host>
>>>
>>> These APIs would allow true zero-copy between JS and the GPU and
>>> allocating staging memory in the content process synchronously via the
>>> following mechanisms:
>>>
>>>    - In the content process, allocate shared-memory between the content
>>>    process and GPU process (at least in Chromium that's possible).
>>>    - Send the shared memory to the GPU process.
>>>    - In the GPU process create a new resource by wrapping that shared
>>>    memory region (or allocating a new resource if it's not possible).
>>>    - In the content process, write to the memory then send an unmap
>>>    signal to the GPU process.
>>>    - On the GPU process, if wrapping was not possible, copy from the
>>>    shmem to the GPU resource.
>>>    - Profit!
>>>
>>> An idea I was exploring is having something like mapSync that can
>>> replaces the allocation of a GPUBuffer to a new native buffer via the
>>> mechanism described above. However a design constraint we have been
>>> operating with is that a WebGPU resource is exactly a native API resource
>>> so that doesn't work either. (imagine we baked bindgroups with the address
>>> of the buffer, we want to avoid needing dirtying mechanisms).
>>>
>>> *Conclusion*
>>>
>>> Like the other times I tried, I wasn't able to come up with a better
>>> solution than mapAsync. It's the only one that works so far but the
>>> asynchrony makes it a bit difficult for people to use so it'd be nice to
>>> have an alternative.
>>>
>>> At least I learnt an important design constraint, and discovered that it
>>> is possible to wrap CPU memory in a GPU resource to optimize things. Also I
>>> started a HackMD to discuss tradeoffs again
>>> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. It doesn't have
>>> content but at least it has links to all the proposals if you want to keep
>>> it as a set of bookmarks.
>>>
>>> Cheers,
>>>
>>> Corentin
>>>
>>
Received on Friday, 28 February 2020 16:35:27 UTC