Re: More thoughts on buffer mapping (no improvement on mapAsync sadly) from Ken Russell on 2020-02-28 (public-gpu@w3.org from February 2020)

From: Ken Russell <kbr@google.com>
Date: Thu, 27 Feb 2020 16:32:34 -0800
To: Corentin Wallez <cwallez@google.com>
Cc: internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
Message-ID: <CAMYvS2dydE85PA2rrcyfNVhwc-0mQg3HWd2QkN54VTNdy+U1xQ@mail.gmail.com>
On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com> wrote:

> Whoops, meant to send this to public-gpu.
>
> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com>
> wrote:
>
>> Hey all,
>>
>> I spent quite some time yesterday trying to understand how alternatives
>> like Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506>
>> or Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> could
>> be implemented and which modifications they would need to be implementable.
>> At the end of the day an important property became clear for
>> implementations looking to minimize copies using shared memory between the
>> content process and the GPU process. Below I'm using terminology from
>> buffer mapping proposals but the same is true for proposals that would
>> introduce new staging buffer objects.
>>
>> *Accurate tracking on the content process of which mappable buffers are
>> in use is not tractable.*
>>
>> We want to avoid races so that the data written by JS at the time it says
>> call unmap is what gets seen by the GPU until the next mapping operation
>> (either mapSync, or mapAsync promise resolution). Without some asynchronous
>> ownership transfer (mapAsync like) I claim it is not tractable to know
>> accurately on the content-process side whether it is safe to write to the
>> shared memory region.
>>
>> The first reason why it is not tractable, is that we don't know
>> accurately if queue submits that use the mappable resource are finished.
>> While it is "easy" to follow progress of queues themselves, knowing which
>> submits use the resource would require adding duplicated content-side
>> tracking in bindgroups, all encoders, command buffers, bundles to know what
>> mappable resources they hold. This would be a many-fold increase in the
>> amount of tracking that WebGPU requires on the content side.
>>
>
Is this true for all WebGPU implementations, or Chrome's implementation
specifically? If such tracking were implemented, would it impose an
unacceptable performance overhead? From my relatively uninformed
perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of
all of the buffers or other mappable resources referenced by these data
structures during command encoding and queue submission seems tractable.
The tree of data structures that would have to be traversed doesn't seem
that deep. The area that would have to be most optimized would be
communication from the GPU process back to the content process that those
buffers are no longer referenced. Most likely that would have to be handled
with atomic operations in shared memory rather than via IPCs.


Second, even doing the tracking above is not enough for accurate tracking
>> because it doesn't take into account error propagation. What if one of the
>> commands in the submit that uses the mappable resource is an error? Is the
>> resource still considered in use by the content side when the GPU process
>> side does nothing? One solution would be to duplicate all the validation on
>> the content side, but that's way too expensive, and can't help with error
>> objects due to OOM.
>>
>
Conservatively assuming that the resources are referenced even if the
command will be an error seems fine from the standpoint of writing portable
content. I agree wholeheartedly that the validation should not be
duplicated on the content side, but it seems to me that a conservative
approximation would work well enough. Errors are neither the common nor
expected case.

Hoping that if we can iterate a bit on this discussion, that some of the
analyses and conclusions below might be able to be revisited.

-Ken


*Consequences for the existing proposals.*
>>
>> Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506> has
>> two alternatives, either the mapping operation blocks when the buffer is in
>> use, or a new staging area is returned that will be copied in the buffer at
>> a later time. Since accurate tracking isn't possible, either WebGPU will
>> most often do a roundtrip to the GPU process (aka block) or it will most
>> often introduce an extra copy.
>>
>> Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> has
>> the mapping operation return null when the buffer is in use. This requires
>> knowing accurately when it is in use and is not tractable.
>>
>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697>
>> Myles' comment about mapAsync
>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where
>> a desirable property would be that the mapAsync promise resolve immediately
>> if the buffer is currently not in use. Guaranteeing this at the spec level
>> requires accurate tracking of which buffers are in use so it isn't
>> possible. There needs to be at least one roundtrip to the GPU process to
>> "lock" the resource.
>>
>> *Another direction using APIs that allow wrapping CPU memory into a
>> buffer*
>>
>> The following APIs allow taking an OS memory object, or just a pointer,
>> and turn it into a GPU resource:
>>
>>    - ID3D12Device3::OpenExistingHeapFromAddress
>>    <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress>
>>    - MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>>    <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc>
>>    - VK_EXT_external_memory_host
>>    <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host>
>>
>> These APIs would allow true zero-copy between JS and the GPU and
>> allocating staging memory in the content process synchronously via the
>> following mechanisms:
>>
>>    - In the content process, allocate shared-memory between the content
>>    process and GPU process (at least in Chromium that's possible).
>>    - Send the shared memory to the GPU process.
>>    - In the GPU process create a new resource by wrapping that shared
>>    memory region (or allocating a new resource if it's not possible).
>>    - In the content process, write to the memory then send an unmap
>>    signal to the GPU process.
>>    - On the GPU process, if wrapping was not possible, copy from the
>>    shmem to the GPU resource.
>>    - Profit!
>>
>> An idea I was exploring is having something like mapSync that can
>> replaces the allocation of a GPUBuffer to a new native buffer via the
>> mechanism described above. However a design constraint we have been
>> operating with is that a WebGPU resource is exactly a native API resource
>> so that doesn't work either. (imagine we baked bindgroups with the address
>> of the buffer, we want to avoid needing dirtying mechanisms).
>>
>> *Conclusion*
>>
>> Like the other times I tried, I wasn't able to come up with a better
>> solution than mapAsync. It's the only one that works so far but the
>> asynchrony makes it a bit difficult for people to use so it'd be nice to
>> have an alternative.
>>
>> At least I learnt an important design constraint, and discovered that it
>> is possible to wrap CPU memory in a GPU resource to optimize things. Also I
>> started a HackMD to discuss tradeoffs again
>> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. It doesn't have content
>> but at least it has links to all the proposals if you want to keep it as a
>> set of bookmarks.
>>
>> Cheers,
>>
>> Corentin
>>
>
Received on Friday, 28 February 2020 00:33:02 UTC