Re: More thoughts on buffer mapping (no improvement on mapAsync sadly)

I agree with Ken that relating resources to queue submits in the web process is roughly a trivial amount of complexity.

However, after thinking about this for a while, I believe:

- any architecture which utilizes a GPU process must necessarily not have truly synchronous maps, because the mapping operation has to be done by the GPU process, which necessitates a round trip. This round trip shouldn’t be synchronous because that would hang the main thread.

- if map operations are going to be asynchronous, they can’t use fences to encapsulate the asynchronicity, because they are the wrong level of granularity. Consider a world where we have multi queue. Which queue(s) would you issue the fence to to know whether the resource is ready for mapping? It can’t be “all the queues” because that’s crazy, it can’t be “none of the queues” because that’s not asynchronous, and it can’t be “the queues which are using the resource” because that might be 0 queues. There is no good answer.

Instead, the asynchronicity needs to be handled at the resource granularity. A method on the resource which returns a promise (which is what we have today in mapWriteAsync()) is the right granularity.

So I think I agree that, for the problem of resource mapping, we can’t do better than what we have today.

—Myles

> On Mar 1, 2020, at 11:57 PM, Myles C. Maxfield <mmaxfield@apple.com> wrote:
> 
> 
> 
> 
>>> On Feb 28, 2020, at 12:06 PM, Ken Russell <kbr@google.com> wrote:
>>> 
>> 
>>> On Fri, Feb 28, 2020 at 8:35 AM Corentin Wallez <cwallez@google.com> wrote:
>> 
>>>> On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com> wrote:
>>> 
>>>>> On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com> wrote:
>>>> 
>>>>> Whoops, meant to send this to public-gpu.
>>>>> 
>>>>>> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com> wrote:
>>>>>> Hey all,
>>>>>> 
>>>>>> I spent quite some time yesterday trying to understand how alternatives like Synchronous mapping #506 or Failable mapping #511 could be implemented and which modifications they would need to be implementable. At the end of the day an important property became clear for implementations looking to minimize copies using shared memory between the content process and the GPU process. Below I'm using terminology from buffer mapping proposals but the same is true for proposals that would introduce new staging buffer objects.
>>>>>> 
>>>>>> Accurate tracking on the content process of which mappable buffers are in use is not tractable.
>>>>>> 
>>>>>> We want to avoid races so that the data written by JS at the time it says call unmap is what gets seen by the GPU until the next mapping operation (either mapSync, or mapAsync promise resolution). Without some asynchronous ownership transfer (mapAsync like) I claim it is not tractable to know accurately on the content-process side whether it is safe to write to the shared memory region.
>>>>>> 
>>>>>> The first reason why it is not tractable, is that we don't know accurately if queue submits that use the mappable resource are finished. While it is "easy" to follow progress of queues themselves, knowing which submits use the resource would require adding duplicated content-side tracking in bindgroups, all encoders, command buffers, bundles to know what mappable resources they hold. This would be a many-fold increase in the amount of tracking that WebGPU requires on the content side.
>>>> 
>>>> Is this true for all WebGPU implementations, or Chrome's implementation specifically? If such tracking were implemented, would it impose an unacceptable performance overhead? From my relatively uninformed perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of all of the buffers or other mappable resources referenced by these data structures during command encoding and queue submission seems tractable. The tree of data structures that would have to be traversed doesn't seem that deep. The area that would have to be most optimized would be communication from the GPU process back to the content process that those buffers are no longer referenced. Most likely that would have to be handled with atomic operations in shared memory rather than via IPCs. 
>>>>  
>>> This would be true to varying degrees for all multiprocess implementations. For example at the moment in Chromium the content side data for a GPUBindGroup is just a pointer handle that itself contains two uint32_t handles
> 
> I understand that this is an elegant design. However, “theoretical purity” is literally the lowest priority in the Priority of Constituencies https://www.w3.org/TR/html-design-principles/#priority-of-constituencies
> 
>>> , and that's the case for most other WebGPU Blink side objects. It's nice to be able to keep that if possible.
>> 
>> Understood that it would be nice to keep things simple.
>> 
>> The question I'd like answered is - if this tracking were implemented, and if it wasn't too costly - then would the "Consequences for the existing proposals" below change? It seems to me that, for example, fallible mapping would be a viable possibility in that scenario. The analyses below assumed a priori that client-side tracking was intractable.
>> 
>> A follow-on question - how difficult would it really be to add conservative client-side tracking? Could other WebGPU implementers provide their opinions too?
>> 
>> -Ken
>> 
>> 
>>> The content-side can already see progress on the GPU without the need for atomics with GPUFence signaling which would also help make the effects of a buffer being no-longer in use happen when the application can observe the completion of the GPU work.
>>>> 
>>>>>> Second, even doing the tracking above is not enough for accurate tracking because it doesn't take into account error propagation. What if one of the commands in the submit that uses the mappable resource is an error? Is the resource still considered in use by the content side when the GPU process side does nothing? One solution would be to duplicate all the validation on the content side, but that's way too expensive, and can't help with error objects due to OOM.
>>>> 
>>>> Conservatively assuming that the resources are referenced even if the command will be an error seems fine from the standpoint of writing portable content. I agree wholeheartedly that the validation should not be duplicated on the content side, but it seems to me that a conservative approximation would work well enough. Errors are neither the common nor expected case.
>>>> 
>>>> Hoping that if we can iterate a bit on this discussion, that some of the analyses and conclusions below might be able to be revisited.
>>>> 
>>> It's correct that the only path forward that's different than mapAsync is one where there is a conservative estimation of whether a buffer is mapped or not, that only has false-nagetives (i.e. the buffer can be considered in use when it is not, but not the contrary).
>>>  
>>>> -Ken
>>>> 
>>>> 
>>>>>> Consequences for the existing proposals.
>>>>>> 
>>>>>> Synchronous mapping #506 has two alternatives, either the mapping operation blocks when the buffer is in use, or a new staging area is returned that will be copied in the buffer at a later time. Since accurate tracking isn't possible, either WebGPU will most often do a roundtrip to the GPU process (aka block) or it will most often introduce an extra copy.
>>>>>> 
>>>>>> Failable mapping #511 has the mapping operation return null when the buffer is in use. This requires knowing accurately when it is in use and is not tractable.
>>>>>> 
>>>>>> Myles' comment about mapAsync where a desirable property would be that the mapAsync promise resolve immediately if the buffer is currently not in use. Guaranteeing this at the spec level requires accurate tracking of which buffers are in use so it isn't possible. There needs to be at least one roundtrip to the GPU process to "lock" the resource.
>>>>>> 
>>>>>> Another direction using APIs that allow wrapping CPU memory into a buffer
>>>>>> 
>>>>>> The following APIs allow taking an OS memory object, or just a pointer, and turn it into a GPU resource:
>>>>>> ID3D12Device3::OpenExistingHeapFromAddress
>>>>>> MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>>>>>> VK_EXT_external_memory_host
>>>>>> These APIs would allow true zero-copy between JS and the GPU and allocating staging memory in the content process synchronously via the following mechanisms:
>>>>>> In the content process, allocate shared-memory between the content process and GPU process (at least in Chromium that's possible).
>>>>>> Send the shared memory to the GPU process.
>>>>>> In the GPU process create a new resource by wrapping that shared memory region (or allocating a new resource if it's not possible).
>>>>>> In the content process, write to the memory then send an unmap signal to the GPU process.
>>>>>> On the GPU process, if wrapping was not possible, copy from the shmem to the GPU resource.
>>>>>> Profit!
>>>>>> An idea I was exploring is having something like mapSync that can replaces the allocation of a GPUBuffer to a new native buffer via the mechanism described above. However a design constraint we have been operating with is that a WebGPU resource is exactly a native API resource so that doesn't work either. (imagine we baked bindgroups with the address of the buffer, we want to avoid needing dirtying mechanisms).
>>>>>> 
>>>>>> Conclusion
>>>>>> 
>>>>>> Like the other times I tried, I wasn't able to come up with a better solution than mapAsync. It's the only one that works so far but the asynchrony makes it a bit difficult for people to use so it'd be nice to have an alternative.
>>>>>> 
>>>>>> At least I learnt an important design constraint, and discovered that it is possible to wrap CPU memory in a GPU resource to optimize things. Also I started a HackMD to discuss tradeoffs again. It doesn't have content but at least it has links to all the proposals if you want to keep it as a set of bookmarks.
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Corentin

Received on Monday, 2 March 2020 08:13:01 UTC