Re: More thoughts on buffer mapping (no improvement on mapAsync sadly)

Whoops, meant to send this to public-gpu.

On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com> wrote:

> Hey all,
>
> I spent quite some time yesterday trying to understand how alternatives
> like Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506>
> or Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> could
> be implemented and which modifications they would need to be implementable.
> At the end of the day an important property became clear for
> implementations looking to minimize copies using shared memory between the
> content process and the GPU process. Below I'm using terminology from
> buffer mapping proposals but the same is true for proposals that would
> introduce new staging buffer objects.
>
> *Accurate tracking on the content process of which mappable buffers are in
> use is not tractable.*
>
> We want to avoid races so that the data written by JS at the time it says
> call unmap is what gets seen by the GPU until the next mapping operation
> (either mapSync, or mapAsync promise resolution). Without some asynchronous
> ownership transfer (mapAsync like) I claim it is not tractable to know
> accurately on the content-process side whether it is safe to write to the
> shared memory region.
>
> The first reason why it is not tractable, is that we don't know accurately
> if queue submits that use the mappable resource are finished. While it is
> "easy" to follow progress of queues themselves, knowing which submits use
> the resource would require adding duplicated content-side tracking in
> bindgroups, all encoders, command buffers, bundles to know what mappable
> resources they hold. This would be a many-fold increase in the amount of
> tracking that WebGPU requires on the content side.
>
> Second, even doing the tracking above is not enough for accurate tracking
> because it doesn't take into account error propagation. What if one of the
> commands in the submit that uses the mappable resource is an error? Is the
> resource still considered in use by the content side when the GPU process
> side does nothing? One solution would be to duplicate all the validation on
> the content side, but that's way too expensive, and can't help with error
> objects due to OOM.
>
> *Consequences for the existing proposals.*
>
> Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506> has
> two alternatives, either the mapping operation blocks when the buffer is in
> use, or a new staging area is returned that will be copied in the buffer at
> a later time. Since accurate tracking isn't possible, either WebGPU will
> most often do a roundtrip to the GPU process (aka block) or it will most
> often introduce an extra copy.
>
> Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> has the
> mapping operation return null when the buffer is in use. This requires
> knowing accurately when it is in use and is not tractable.
>
> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697>
> Myles' comment about mapAsync
> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where
> a desirable property would be that the mapAsync promise resolve immediately
> if the buffer is currently not in use. Guaranteeing this at the spec level
> requires accurate tracking of which buffers are in use so it isn't
> possible. There needs to be at least one roundtrip to the GPU process to
> "lock" the resource.
>
> *Another direction using APIs that allow wrapping CPU memory into a buffer*
>
> The following APIs allow taking an OS memory object, or just a pointer,
> and turn it into a GPU resource:
>
>    - ID3D12Device3::OpenExistingHeapFromAddress
>    <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress>
>    - MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>    <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc>
>    - VK_EXT_external_memory_host
>    <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host>
>
> These APIs would allow true zero-copy between JS and the GPU and
> allocating staging memory in the content process synchronously via the
> following mechanisms:
>
>    - In the content process, allocate shared-memory between the content
>    process and GPU process (at least in Chromium that's possible).
>    - Send the shared memory to the GPU process.
>    - In the GPU process create a new resource by wrapping that shared
>    memory region (or allocating a new resource if it's not possible).
>    - In the content process, write to the memory then send an unmap
>    signal to the GPU process.
>    - On the GPU process, if wrapping was not possible, copy from the
>    shmem to the GPU resource.
>    - Profit!
>
> An idea I was exploring is having something like mapSync that can replaces
> the allocation of a GPUBuffer to a new native buffer via the mechanism
> described above. However a design constraint we have been operating with is
> that a WebGPU resource is exactly a native API resource so that doesn't
> work either. (imagine we baked bindgroups with the address of the buffer,
> we want to avoid needing dirtying mechanisms).
>
> *Conclusion*
>
> Like the other times I tried, I wasn't able to come up with a better
> solution than mapAsync. It's the only one that works so far but the
> asynchrony makes it a bit difficult for people to use so it'd be nice to
> have an alternative.
>
> At least I learnt an important design constraint, and discovered that it
> is possible to wrap CPU memory in a GPU resource to optimize things. Also I
> started a HackMD to discuss tradeoffs again
> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. It doesn't have content
> but at least it has links to all the proposals if you want to keep it as a
> set of bookmarks.
>
> Cheers,
>
> Corentin
>

Received on Thursday, 27 February 2020 13:41:37 UTC