Re: More thoughts on buffer mapping (no improvement on mapAsync sadly) from Myles C. Maxfield on 2020-03-03 (public-gpu@w3.org from March 2020)

From: Myles C. Maxfield <mmaxfield@apple.com>
Date: Tue, 03 Mar 2020 10:23:26 -0800
To: Dzmitry Malyshau <dmalyshau@mozilla.com>
Cc: Ken Russell <kbr@google.com>, Corentin Wallez <cwallez@google.com>, Jeff Gilbert <jgilbert@mozilla.com>, internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
Message-id: <0471B88C-9763-40D1-BD3D-B1751ED96778@apple.com>
> On Mar 2, 2020, at 8:59 AM, Dzmitry Malyshau <dmalyshau@mozilla.com> wrote:
> 
> 
> Myles,
> 
> (comment is inlined)
> 
> 
> 
> On 3/2/20 11:46 AM, Myles C. Maxfield wrote:
>> 
>> 
>>> On Mar 2, 2020, at 7:35 AM, Dzmitry Malyshau <dmalyshau@mozilla.com> wrote:
>>> 
>>> 
>>> Hi Myles,
>>> 
>>> It's an important point to see how multi-queue will interact with it, and I agree relying on fences makes it complicated.
>>> 
>>> What I think is missing from the discussion is some concrete use-cases that would be solved by one or the other solution. One such use-case we've been seeing in multiple ISV code bases is progressively filling up a uniform buffer with data from the CPU. Supposing the buffer is big enough, the user would go through the entities, for each they would append the uniform data to the mapped buffer and bind it with the dynamic offset in the command buffer. For the next frame, the user would use the same buffer, just continue writing data as if the buffer was circular.
>>> 
>>> Current mapWriteAsync doesn't provide a good solution to this case. First, because the user would have to keep the whole buffer mapped, and thus our implementations would need to consider the whole buffer to be changed each frame. Second, because the user would need to rotate multiple buffers, and thus would duplicate any bind groups that contain them, if they want minimal number of copies. Alternatively, they can rotate only staging buffers and issue copyBufferToBuffer into the actual GPU uniform buffer each frame, but this approach has the extra copy, more memory, and all the complexity of buffer updates in it.
>>> 
>>> The "guaranteed" mapping proposals appear to address this case better. The user would map the uniform buffer either for each entity, or for some chunks of memory. The implementations would have to guarantee that using specifically aligned offsets and checking for fences (in case of single queue) guarantees the mapping to be resolved. What's not clear to me is how many JS objects would be required to be created. For example, https://github.com/gpuweb/gpuweb/pull/511 returns an ArrayBuffer for every map(), and in practice they would have to create some typed views on top in order to fill in the contents. I suppose the user could map larger chunks at a time, theoretically, but it's certainly not the pit of success here.
>>> 
>>> Finally, the `Queue.writeToBuffer` would naturally have the user providing sub-slices of their array buffers to update the relevant uniform buffer data for each entity. I don't think there is necessarily even an extra copy involved: the user could keep the uniform data around per entity on the CPU (in some ArrayBuffer objects), assuming that not all the data for all the objects change per frame. The arguable downside here is that the user wouldn't know is whether the data is getting written to the destination buffer directly, or placed in the staging area internally.
>>> 
>>> > any architecture which utilizes a GPU process must necessarily not have truly synchronous maps, because the mapping operation has to be done by the GPU process, which necessitates a round trip.
>>> 
>>> This isn't necessarily true. Unless you are running on D3D11, the GPU resource can be permanently mapped, so no preparation is required for using the map.
>>> 
>> 
>> Mapped buffers are populated with “pinned” memory pages in the kernel, which represents pages which the kernel can’t manage by relocating, etc. It would be a bad idea to unconditionally pin all pages from all buffers for all time. In general, WebKit tries to minimize the amount of pinned pages it uses.
> 
> We are only talking about mappable buffers, not *all* the buffers. Moreover, createBufferMapped(), if it's here to stay, could be specified as one-time mappable, so you'd unpin the relevant memory pages after the initial mapping. So this leaves just the frequently mapped buffers to be pinned.
> 
I think this is still “all buffers with a mapping usage flag” which is still too much memory to unconditionally pin for the lifetime of the entire document.

Also, there’s some context which may have been lost here: WebKit has started work on a GPU process (in the open source repository; this is public information)

>> 
>>> Saying that, I'm also recalling VK_ACCESS_HOST_WRITE_BIT and VK_ACCESS_HOST_READ_BIT, which need to be set in the access flags within pipeline barriers in order for the CPU to safely read from the buffer or write to it. We could always set them for the mappable buffers in the pipeline barriers at the end of submits. It's something to keep in mind.
>>> 
>>> Thanks,
>>> 
>>> Dzmitry
>>> 
>>> On 3/2/20 3:20 AM, Myles C. Maxfield wrote:
>>>> 
>>>> 
>>>>> On Mar 2, 2020, at 12:12 AM, Myles C. Maxfield <mmaxfield@apple.com> wrote:
>>>>> 
>>>>> 
>>>>> I agree with Ken that relating resources to queue submits in the web process is roughly a trivial amount of complexity.
>>>>> 
>>>>> However, after thinking about this for a while, I believe:
>>>>> 
>>>>> - any architecture which utilizes a GPU process must necessarily not have truly synchronous maps, because the mapping operation has to be done by the GPU process, which necessitates a round trip. This round trip shouldn’t be synchronous because that would hang the main thread.
>>>>> 
>>>>> - if map operations are going to be asynchronous, they can’t use fences to encapsulate the asynchronicity, because they are the wrong level of granularity. Consider a world where we have multi queue. Which queue(s) would you issue the fence to to know whether the resource is ready for mapping? It can’t be “all the queues” because that’s crazy, it can’t be “none of the queues” because that’s not asynchronous, and it can’t be “the queues which are using the resource” because that might be 0 queues. There is no good answer.
>>>> 
>>>> And putting “map()” on the queue itself makes no sense, because how does the queue know when the mapping operation is safe to perform? The resource might be in use on other queues. We shouldn’t make one queue operation block on the progress of every other queue.
>>>> 
>>>>> 
>>>>> Instead, the asynchronicity needs to be handled at the resource granularity. A method on the resource which returns a promise (which is what we have today in mapWriteAsync()) is the right granularity.
>>>>> 
>>>>> So I think I agree that, for the problem of resource mapping, we can’t do better than what we have today.
>>>>> 
>>>>> —Myles
>>>>> 
>>>>>> On Mar 1, 2020, at 11:57 PM, Myles C. Maxfield <mmaxfield@apple.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 28, 2020, at 12:06 PM, Ken Russell <kbr@google.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Feb 28, 2020 at 8:35 AM Corentin Wallez <cwallez@google.com> wrote:
>>>>>>>> On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com> wrote:
>>>>>>>>>> On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez <cwallez@google.com> wrote:
>>>>>>>>>>> Whoops, meant to send this to public-gpu.
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez <cwallez@google.com> wrote:
>>>>>>>>>>>> Hey all,
>>>>>>>>>>>> 
>>>>>>>>>>>> I spent quite some time yesterday trying to understand how alternatives like Synchronous mapping #506 or Failable mapping #511 could be implemented and which modifications they would need to be implementable. At the end of the day an important property became clear for implementations looking to minimize copies using shared memory between the content process and the GPU process. Below I'm using terminology from buffer mapping proposals but the same is true for proposals that would introduce new staging buffer objects.
>>>>>>>>>>>> 
>>>>>>>>>>>> Accurate tracking on the content process of which mappable buffers are in use is not tractable.
>>>>>>>>>>>> 
>>>>>>>>>>>> We want to avoid races so that the data written by JS at the time it says call unmap is what gets seen by the GPU until the next mapping operation (either mapSync, or mapAsync promise resolution). Without some asynchronous ownership transfer (mapAsync like) I claim it is not tractable to know accurately on the content-process side whether it is safe to write to the shared memory region.
>>>>>>>>>>>> 
>>>>>>>>>>>> The first reason why it is not tractable, is that we don't know accurately if queue submits that use the mappable resource are finished. While it is "easy" to follow progress of queues themselves, knowing which submits use the resource would require adding duplicated content-side tracking in bindgroups, all encoders, command buffers, bundles to know what mappable resources they hold. This would be a many-fold increase in the amount of tracking that WebGPU requires on the content side.
>>>>>>>>>> 
>>>>>>>>>> Is this true for all WebGPU implementations, or Chrome's implementation specifically? If such tracking were implemented, would it impose an unacceptable performance overhead? From my relatively uninformed perspective (I'm not a WebGPU or Dawn expert), conservatively gathering of all of the buffers or other mappable resources referenced by these data structures during command encoding and queue submission seems tractable. The tree of data structures that would have to be traversed doesn't seem that deep. The area that would have to be most optimized would be communication from the GPU process back to the content process that those buffers are no longer referenced. Most likely that would have to be handled with atomic operations in shared memory rather than via IPCs. 
>>>>>>>>>  
>>>>>>>> This would be true to varying degrees for all multiprocess implementations. For example at the moment in Chromium the content side data for a GPUBindGroup is just a pointer handle that itself contains two uint32_t handles
>>>>>> 
>>>>>> I understand that this is an elegant design. However, “theoretical purity” is literally the lowest priority in the Priority of Constituencies https://www.w3.org/TR/html-design-principles/#priority-of-constituencies
>>>>>> 
>>>>>>>> , and that's the case for most other WebGPU Blink side objects. It's nice to be able to keep that if possible.
>>>>>>> 
>>>>>>> Understood that it would be nice to keep things simple.
>>>>>>> 
>>>>>>> The question I'd like answered is - if this tracking were implemented, and if it wasn't too costly - then would the "Consequences for the existing proposals" below change? It seems to me that, for example, fallible mapping would be a viable possibility in that scenario. The analyses below assumed a priori that client-side tracking was intractable.
>>>>>>> 
>>>>>>> A follow-on question - how difficult would it really be to add conservative client-side tracking? Could other WebGPU implementers provide their opinions too?
>>>>>>> 
>>>>>>> -Ken
>>>>>>> 
>>>>>>> 
>>>>>>>> The content-side can already see progress on the GPU without the need for atomics with GPUFence signaling which would also help make the effects of a buffer being no-longer in use happen when the application can observe the completion of the GPU work.
>>>>>>>>> 
>>>>>>>>>>> Second, even doing the tracking above is not enough for accurate tracking because it doesn't take into account error propagation. What if one of the commands in the submit that uses the mappable resource is an error? Is the resource still considered in use by the content side when the GPU process side does nothing? One solution would be to duplicate all the validation on the content side, but that's way too expensive, and can't help with error objects due to OOM.
>>>>>>>>> 
>>>>>>>>> Conservatively assuming that the resources are referenced even if the command will be an error seems fine from the standpoint of writing portable content. I agree wholeheartedly that the validation should not be duplicated on the content side, but it seems to me that a conservative approximation would work well enough. Errors are neither the common nor expected case.
>>>>>>>>> 
>>>>>>>>> Hoping that if we can iterate a bit on this discussion, that some of the analyses and conclusions below might be able to be revisited.
>>>>>>>>> 
>>>>>>>> It's correct that the only path forward that's different than mapAsync is one where there is a conservative estimation of whether a buffer is mapped or not, that only has false-nagetives (i.e. the buffer can be considered in use when it is not, but not the contrary).
>>>>>>>>  
>>>>>>>>> -Ken
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> Consequences for the existing proposals.
>>>>>>>>>>> 
>>>>>>>>>>> Synchronous mapping #506 has two alternatives, either the mapping operation blocks when the buffer is in use, or a new staging area is returned that will be copied in the buffer at a later time. Since accurate tracking isn't possible, either WebGPU will most often do a roundtrip to the GPU process (aka block) or it will most often introduce an extra copy.
>>>>>>>>>>> 
>>>>>>>>>>> Failable mapping #511 has the mapping operation return null when the buffer is in use. This requires knowing accurately when it is in use and is not tractable.
>>>>>>>>>>> 
>>>>>>>>>>> Myles' comment about mapAsync where a desirable property would be that the mapAsync promise resolve immediately if the buffer is currently not in use. Guaranteeing this at the spec level requires accurate tracking of which buffers are in use so it isn't possible. There needs to be at least one roundtrip to the GPU process to "lock" the resource.
>>>>>>>>>>> 
>>>>>>>>>>> Another direction using APIs that allow wrapping CPU memory into a buffer
>>>>>>>>>>> 
>>>>>>>>>>> The following APIs allow taking an OS memory object, or just a pointer, and turn it into a GPU resource:
>>>>>>>>>>> ID3D12Device3::OpenExistingHeapFromAddress
>>>>>>>>>>> MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>>>>>>>>>>> VK_EXT_external_memory_host
>>>>>>>>>>> These APIs would allow true zero-copy between JS and the GPU and allocating staging memory in the content process synchronously via the following mechanisms:
>>>>>>>>>>> In the content process, allocate shared-memory between the content process and GPU process (at least in Chromium that's possible).
>>>>>>>>>>> Send the shared memory to the GPU process.
>>>>>>>>>>> In the GPU process create a new resource by wrapping that shared memory region (or allocating a new resource if it's not possible).
>>>>>>>>>>> In the content process, write to the memory then send an unmap signal to the GPU process.
>>>>>>>>>>> On the GPU process, if wrapping was not possible, copy from the shmem to the GPU resource.
>>>>>>>>>>> Profit!
>>>>>>>>>>> An idea I was exploring is having something like mapSync that can replaces the allocation of a GPUBuffer to a new native buffer via the mechanism described above. However a design constraint we have been operating with is that a WebGPU resource is exactly a native API resource so that doesn't work either. (imagine we baked bindgroups with the address of the buffer, we want to avoid needing dirtying mechanisms).
>>>>>>>>>>> 
>>>>>>>>>>> Conclusion
>>>>>>>>>>> 
>>>>>>>>>>> Like the other times I tried, I wasn't able to come up with a better solution than mapAsync. It's the only one that works so far but the asynchrony makes it a bit difficult for people to use so it'd be nice to have an alternative.
>>>>>>>>>>> 
>>>>>>>>>>> At least I learnt an important design constraint, and discovered that it is possible to wrap CPU memory in a GPU resource to optimize things. Also I started a HackMD to discuss tradeoffs again. It doesn't have content but at least it has links to all the proposals if you want to keep it as a set of bookmarks.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> 
>>>>>>>>>>> Corentin
Received on Tuesday, 3 March 2020 18:23:52 UTC