- From: Corentin Wallez <cwallez@google.com>
- Date: Fri, 28 Feb 2020 17:35:02 +0100
- To: Ken Russell <kbr@google.com>
- Cc: internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
- Message-ID: <CAGdfWNM1NztAA8o=Zduus3-+++jcbVJeUY1VQQbgBUgEe5ba5g@mail.gmail.com>
On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com> wrote: > On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com> > wrote: > >> Whoops, meant to send this to public-gpu. >> >> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com> >> wrote: >> >>> Hey all, >>> >>> I spent quite some time yesterday trying to understand how alternatives >>> like Synchronous mapping #506 >>> <https://github.com/gpuweb/gpuweb/pull/506> or Failable mapping #511 >>> <https://github.com/gpuweb/gpuweb/pull/511> could be implemented and >>> which modifications they would need to be implementable. At the end of the >>> day an important property became clear for implementations looking to >>> minimize copies using shared memory between the content process and the GPU >>> process. Below I'm using terminology from buffer mapping proposals but the >>> same is true for proposals that would introduce new staging buffer objects. >>> >>> *Accurate tracking on the content process of which mappable buffers are >>> in use is not tractable.* >>> >>> We want to avoid races so that the data written by JS at the time it >>> says call unmap is what gets seen by the GPU until the next mapping >>> operation (either mapSync, or mapAsync promise resolution). Without some >>> asynchronous ownership transfer (mapAsync like) I claim it is not tractable >>> to know accurately on the content-process side whether it is safe to write >>> to the shared memory region. >>> >>> The first reason why it is not tractable, is that we don't know >>> accurately if queue submits that use the mappable resource are finished. >>> While it is "easy" to follow progress of queues themselves, knowing which >>> submits use the resource would require adding duplicated content-side >>> tracking in bindgroups, all encoders, command buffers, bundles to know what >>> mappable resources they hold. This would be a many-fold increase in the >>> amount of tracking that WebGPU requires on the content side. >>> >> > Is this true for all WebGPU implementations, or Chrome's implementation > specifically? If such tracking were implemented, would it impose an > unacceptable performance overhead? From my relatively uninformed > perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of > all of the buffers or other mappable resources referenced by these data > structures during command encoding and queue submission seems tractable. > The tree of data structures that would have to be traversed doesn't seem > that deep. The area that would have to be most optimized would be > communication from the GPU process back to the content process that those > buffers are no longer referenced. Most likely that would have to be handled > with atomic operations in shared memory rather than via IPCs. > > This would be true to varying degrees for all multiprocess implementations. For example at the moment in Chromium the content side data for a GPUBindGroup is just a pointer handle that itself contains two uint32_t handles, and that's the case for most other WebGPU Blink side objects. It's nice to be able to keep that if possible. The content-side can already see progress on the GPU without the need for atomics with GPUFence signaling which would also help make the effects of a buffer being no-longer in use happen when the application can observe the completion of the GPU work. > > Second, even doing the tracking above is not enough for accurate tracking >>> because it doesn't take into account error propagation. What if one of the >>> commands in the submit that uses the mappable resource is an error? Is the >>> resource still considered in use by the content side when the GPU process >>> side does nothing? One solution would be to duplicate all the validation on >>> the content side, but that's way too expensive, and can't help with error >>> objects due to OOM. >>> >> > Conservatively assuming that the resources are referenced even if the > command will be an error seems fine from the standpoint of writing portable > content. I agree wholeheartedly that the validation should not be > duplicated on the content side, but it seems to me that a conservative > approximation would work well enough. Errors are neither the common nor > expected case. > > Hoping that if we can iterate a bit on this discussion, that some of the > analyses and conclusions below might be able to be revisited. > > It's correct that the only path forward that's different than mapAsync is one where there is a conservative estimation of whether a buffer is mapped or not, that only has false-nagetives (i.e. the buffer can be considered in use when it is not, but not the contrary). > -Ken > > > *Consequences for the existing proposals.* >>> >>> Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506> has >>> two alternatives, either the mapping operation blocks when the buffer is in >>> use, or a new staging area is returned that will be copied in the buffer at >>> a later time. Since accurate tracking isn't possible, either WebGPU will >>> most often do a roundtrip to the GPU process (aka block) or it will most >>> often introduce an extra copy. >>> >>> Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> has >>> the mapping operation return null when the buffer is in use. This requires >>> knowing accurately when it is in use and is not tractable. >>> >>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> >>> Myles' comment about mapAsync >>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where >>> a desirable property would be that the mapAsync promise resolve immediately >>> if the buffer is currently not in use. Guaranteeing this at the spec level >>> requires accurate tracking of which buffers are in use so it isn't >>> possible. There needs to be at least one roundtrip to the GPU process to >>> "lock" the resource. >>> >>> *Another direction using APIs that allow wrapping CPU memory into a >>> buffer* >>> >>> The following APIs allow taking an OS memory object, or just a pointer, >>> and turn it into a GPU resource: >>> >>> - ID3D12Device3::OpenExistingHeapFromAddress >>> <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress> >>> - MTLDevice newBufferWithBytesNoCopy:length:options:deallocator: >>> <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc> >>> - VK_EXT_external_memory_host >>> <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host> >>> >>> These APIs would allow true zero-copy between JS and the GPU and >>> allocating staging memory in the content process synchronously via the >>> following mechanisms: >>> >>> - In the content process, allocate shared-memory between the content >>> process and GPU process (at least in Chromium that's possible). >>> - Send the shared memory to the GPU process. >>> - In the GPU process create a new resource by wrapping that shared >>> memory region (or allocating a new resource if it's not possible). >>> - In the content process, write to the memory then send an unmap >>> signal to the GPU process. >>> - On the GPU process, if wrapping was not possible, copy from the >>> shmem to the GPU resource. >>> - Profit! >>> >>> An idea I was exploring is having something like mapSync that can >>> replaces the allocation of a GPUBuffer to a new native buffer via the >>> mechanism described above. However a design constraint we have been >>> operating with is that a WebGPU resource is exactly a native API resource >>> so that doesn't work either. (imagine we baked bindgroups with the address >>> of the buffer, we want to avoid needing dirtying mechanisms). >>> >>> *Conclusion* >>> >>> Like the other times I tried, I wasn't able to come up with a better >>> solution than mapAsync. It's the only one that works so far but the >>> asynchrony makes it a bit difficult for people to use so it'd be nice to >>> have an alternative. >>> >>> At least I learnt an important design constraint, and discovered that it >>> is possible to wrap CPU memory in a GPU resource to optimize things. Also I >>> started a HackMD to discuss tradeoffs again >>> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. It doesn't have >>> content but at least it has links to all the proposals if you want to keep >>> it as a set of bookmarks. >>> >>> Cheers, >>> >>> Corentin >>> >>
Received on Friday, 28 February 2020 16:35:27 UTC