- From: Ken Russell <kbr@google.com>
- Date: Fri, 28 Feb 2020 12:05:48 -0800
- To: Corentin Wallez <cwallez@google.com>, Jeff Gilbert <jgilbert@mozilla.com>, Dzmitry Malyshau <dmalyshau@mozilla.com>
- Cc: internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
- Message-ID: <CAMYvS2fYeYYUuMtxx_MSGagZc=8c7TRmJPYEvdfVEYfBy9cjxg@mail.gmail.com>
On Fri, Feb 28, 2020 at 8:35 AM Corentin Wallez <cwallez@google.com> wrote: > On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com> wrote: > >> On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com> >> wrote: >> >>> Whoops, meant to send this to public-gpu. >>> >>> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com> >>> wrote: >>> >>>> Hey all, >>>> >>>> I spent quite some time yesterday trying to understand how alternatives >>>> like Synchronous mapping #506 >>>> <https://github.com/gpuweb/gpuweb/pull/506> or Failable mapping #511 >>>> <https://github.com/gpuweb/gpuweb/pull/511> could be implemented and >>>> which modifications they would need to be implementable. At the end of the >>>> day an important property became clear for implementations looking to >>>> minimize copies using shared memory between the content process and the GPU >>>> process. Below I'm using terminology from buffer mapping proposals but the >>>> same is true for proposals that would introduce new staging buffer objects. >>>> >>>> *Accurate tracking on the content process of which mappable buffers are >>>> in use is not tractable.* >>>> >>>> We want to avoid races so that the data written by JS at the time it >>>> says call unmap is what gets seen by the GPU until the next mapping >>>> operation (either mapSync, or mapAsync promise resolution). Without some >>>> asynchronous ownership transfer (mapAsync like) I claim it is not tractable >>>> to know accurately on the content-process side whether it is safe to write >>>> to the shared memory region. >>>> >>>> The first reason why it is not tractable, is that we don't know >>>> accurately if queue submits that use the mappable resource are finished. >>>> While it is "easy" to follow progress of queues themselves, knowing which >>>> submits use the resource would require adding duplicated content-side >>>> tracking in bindgroups, all encoders, command buffers, bundles to know what >>>> mappable resources they hold. This would be a many-fold increase in the >>>> amount of tracking that WebGPU requires on the content side. >>>> >>> >> Is this true for all WebGPU implementations, or Chrome's implementation >> specifically? If such tracking were implemented, would it impose an >> unacceptable performance overhead? From my relatively uninformed >> perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of >> all of the buffers or other mappable resources referenced by these data >> structures during command encoding and queue submission seems tractable. >> The tree of data structures that would have to be traversed doesn't seem >> that deep. The area that would have to be most optimized would be >> communication from the GPU process back to the content process that those >> buffers are no longer referenced. Most likely that would have to be handled >> with atomic operations in shared memory rather than via IPCs. >> > >> > This would be true to varying degrees for all multiprocess > implementations. For example at the moment in Chromium the content side > data for a GPUBindGroup is just a pointer handle that itself contains two > uint32_t handles, and that's the case for most other WebGPU Blink side > objects. It's nice to be able to keep that if possible. > Understood that it would be nice to keep things simple. The question I'd like answered is - if this tracking were implemented, and if it wasn't too costly - then would the "Consequences for the existing proposals" below change? It seems to me that, for example, fallible mapping would be a viable possibility in that scenario. The analyses below assumed a priori that client-side tracking was intractable. A follow-on question - how difficult would it really be to add conservative client-side tracking? Could other WebGPU implementers provide their opinions too? -Ken The content-side can already see progress on the GPU without the need for > atomics with GPUFence signaling which would also help make the effects of a > buffer being no-longer in use happen when the application can observe the > completion of the GPU work. > >> >> Second, even doing the tracking above is not enough for accurate tracking >>>> because it doesn't take into account error propagation. What if one of the >>>> commands in the submit that uses the mappable resource is an error? Is the >>>> resource still considered in use by the content side when the GPU process >>>> side does nothing? One solution would be to duplicate all the validation on >>>> the content side, but that's way too expensive, and can't help with error >>>> objects due to OOM. >>>> >>> >> Conservatively assuming that the resources are referenced even if the >> command will be an error seems fine from the standpoint of writing portable >> content. I agree wholeheartedly that the validation should not be >> duplicated on the content side, but it seems to me that a conservative >> approximation would work well enough. Errors are neither the common nor >> expected case. >> >> Hoping that if we can iterate a bit on this discussion, that some of the >> analyses and conclusions below might be able to be revisited. >> >> It's correct that the only path forward that's different than mapAsync is > one where there is a conservative estimation of whether a buffer is mapped > or not, that only has false-nagetives (i.e. the buffer can be considered in > use when it is not, but not the contrary). > > >> -Ken >> >> >> *Consequences for the existing proposals.* >>>> >>>> Synchronous mapping #506 <https://github.com/gpuweb/gpuweb/pull/506> has >>>> two alternatives, either the mapping operation blocks when the buffer is in >>>> use, or a new staging area is returned that will be copied in the buffer at >>>> a later time. Since accurate tracking isn't possible, either WebGPU will >>>> most often do a roundtrip to the GPU process (aka block) or it will most >>>> often introduce an extra copy. >>>> >>>> Failable mapping #511 <https://github.com/gpuweb/gpuweb/pull/511> has >>>> the mapping operation return null when the buffer is in use. This requires >>>> knowing accurately when it is in use and is not tractable. >>>> >>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> >>>> Myles' comment about mapAsync >>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where >>>> a desirable property would be that the mapAsync promise resolve immediately >>>> if the buffer is currently not in use. Guaranteeing this at the spec level >>>> requires accurate tracking of which buffers are in use so it isn't >>>> possible. There needs to be at least one roundtrip to the GPU process to >>>> "lock" the resource. >>>> >>>> *Another direction using APIs that allow wrapping CPU memory into a >>>> buffer* >>>> >>>> The following APIs allow taking an OS memory object, or just a pointer, >>>> and turn it into a GPU resource: >>>> >>>> - ID3D12Device3::OpenExistingHeapFromAddress >>>> <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress> >>>> - MTLDevice newBufferWithBytesNoCopy:length:options:deallocator: >>>> <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc> >>>> - VK_EXT_external_memory_host >>>> <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host> >>>> >>>> These APIs would allow true zero-copy between JS and the GPU and >>>> allocating staging memory in the content process synchronously via the >>>> following mechanisms: >>>> >>>> - In the content process, allocate shared-memory between the >>>> content process and GPU process (at least in Chromium that's possible). >>>> - Send the shared memory to the GPU process. >>>> - In the GPU process create a new resource by wrapping that shared >>>> memory region (or allocating a new resource if it's not possible). >>>> - In the content process, write to the memory then send an unmap >>>> signal to the GPU process. >>>> - On the GPU process, if wrapping was not possible, copy from the >>>> shmem to the GPU resource. >>>> - Profit! >>>> >>>> An idea I was exploring is having something like mapSync that can >>>> replaces the allocation of a GPUBuffer to a new native buffer via the >>>> mechanism described above. However a design constraint we have been >>>> operating with is that a WebGPU resource is exactly a native API resource >>>> so that doesn't work either. (imagine we baked bindgroups with the address >>>> of the buffer, we want to avoid needing dirtying mechanisms). >>>> >>>> *Conclusion* >>>> >>>> Like the other times I tried, I wasn't able to come up with a better >>>> solution than mapAsync. It's the only one that works so far but the >>>> asynchrony makes it a bit difficult for people to use so it'd be nice to >>>> have an alternative. >>>> >>>> At least I learnt an important design constraint, and discovered that >>>> it is possible to wrap CPU memory in a GPU resource to optimize things. >>>> Also I started a HackMD to discuss tradeoffs again >>>> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. It doesn't have >>>> content but at least it has links to all the proposals if you want to keep >>>> it as a set of bookmarks. >>>> >>>> Cheers, >>>> >>>> Corentin >>>> >>>
Received on Friday, 28 February 2020 20:06:17 UTC