Re: More thoughts on buffer mapping (no improvement on mapAsync sadly) from Dzmitry Malyshau on 2020-03-02 (public-gpu@w3.org from March 2020)

From: Dzmitry Malyshau <dmalyshau@mozilla.com>
Date: Mon, 2 Mar 2020 11:58:57 -0500
To: "Myles C. Maxfield" <mmaxfield@apple.com>
Cc: Ken Russell <kbr@google.com>, Corentin Wallez <cwallez@google.com>, Jeff Gilbert <jgilbert@mozilla.com>, internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
Message-ID: <72ecbc51-7877-5593-35a3-fac5a3de257a@mozilla.com>
Myles,

(comment is inlined)


On 3/2/20 11:46 AM, Myles C. Maxfield wrote:
>
>
>> On Mar 2, 2020, at 7:35 AM, Dzmitry Malyshau <dmalyshau@mozilla.com> 
>> wrote:
>>
>> 
>>
>> Hi Myles,
>>
>> It's an important point to see how multi-queue will interact with it, 
>> and I agree relying on fences makes it complicated.
>>
>> What I think is missing from the discussion is some concrete 
>> use-cases that would be solved by one or the other solution. One such 
>> use-case we've been seeing in multiple ISV code bases is 
>> progressively filling up a uniform buffer with data from the CPU. 
>> Supposing the buffer is big enough, the user would go through the 
>> entities, for each they would append the uniform data to the mapped 
>> buffer and bind it with the dynamic offset in the command buffer. For 
>> the next frame, the user would use the same buffer, just continue 
>> writing data as if the buffer was circular.
>>
>> Current mapWriteAsync doesn't provide a good solution to this case. 
>> First, because the user would have to keep the whole buffer mapped, 
>> and thus our implementations would need to consider the whole buffer 
>> to be changed each frame. Second, because the user would need to 
>> rotate multiple buffers, and thus would duplicate any bind groups 
>> that contain them, if they want minimal number of copies. 
>> Alternatively, they can rotate only staging buffers and issue 
>> copyBufferToBuffer into the actual GPU uniform buffer each frame, but 
>> this approach has the extra copy, more memory, and all the complexity 
>> of buffer updates in it.
>>
>> The "guaranteed" mapping proposals appear to address this case 
>> better. The user would map the uniform buffer either for each entity, 
>> or for some chunks of memory. The implementations would have to 
>> guarantee that using specifically aligned offsets and checking for 
>> fences (in case of single queue) guarantees the mapping to be 
>> resolved. What's not clear to me is how many JS objects would be 
>> required to be created. For example, 
>> https://github.com/gpuweb/gpuweb/pull/511 returns an ArrayBuffer for 
>> every map(), and in practice they would have to create some typed 
>> views on top in order to fill in the contents. I suppose the user 
>> could map larger chunks at a time, theoretically, but it's certainly 
>> not the pit of success here.
>>
>> Finally, the `Queue.writeToBuffer` would naturally have the user 
>> providing sub-slices of their array buffers to update the relevant 
>> uniform buffer data for each entity. I don't think there is 
>> necessarily even an extra copy involved: the user could keep the 
>> uniform data around per entity on the CPU (in some ArrayBuffer 
>> objects), assuming that not all the data for all the objects change 
>> per frame. The arguable downside here is that the user wouldn't know 
>> is whether the data is getting written to the destination buffer 
>> directly, or placed in the staging area internally.
>>
>> > any architecture which utilizes a GPU process must necessarily not 
>> have truly synchronous maps, because the mapping operation has to be 
>> done by the GPU process, which necessitates a round trip.
>>
>> This isn't necessarily true. Unless you are running on D3D11, the GPU 
>> resource can be permanently mapped, so no preparation is required for 
>> using the map.
>>
>
> Mapped buffers are populated with “pinned” memory pages in the kernel, 
> which represents pages which the kernel can’t manage by relocating, 
> etc. It would be a bad idea to unconditionally pin all pages from all 
> buffers for all time. In general, WebKit tries to minimize the amount 
> of pinned pages it uses.


We are only talking about mappable buffers, not *all* the buffers. 
Moreover, createBufferMapped(), if it's here to stay, could be specified 
as one-time mappable, so you'd unpin the relevant memory pages after the 
initial mapping. So this leaves just the frequently mapped buffers to be 
pinned.

>
>> Saying that, I'm also recalling |VK_ACCESS_HOST_WRITE_BIT and 
>> ||VK_ACCESS_HOST_READ_BIT|, which need to be set in the access flags 
>> within pipeline barriers in order for the CPU to safely read from the 
>> buffer or write to it. We could always set them for the mappable 
>> buffers in the pipeline barriers at the end of submits. It's 
>> something to keep in mind.
>>
>> Thanks,
>>
>> Dzmitry|
>> |
>>
>> ||
>>
>> On 3/2/20 3:20 AM, Myles C. Maxfield wrote:
>>>
>>>
>>>> On Mar 2, 2020, at 12:12 AM, Myles C. Maxfield 
>>>> <mmaxfield@apple.com> wrote:
>>>>
>>>> 
>>>> I agree with Ken that relating resources to queue submits in the 
>>>> web process is roughly a trivial amount of complexity.
>>>>
>>>> However, after thinking about this for a while, I believe:
>>>>
>>>> - any architecture which utilizes a GPU process must necessarily 
>>>> not have truly synchronous maps, because the mapping operation has 
>>>> to be done by the GPU process, which necessitates a round trip. 
>>>> This round trip shouldn’t be synchronous because that would hang 
>>>> the main thread.
>>>>
>>>> - if map operations are going to be asynchronous, they can’t use 
>>>> fences to encapsulate the asynchronicity, because they are the 
>>>> wrong level of granularity. Consider a world where we have multi 
>>>> queue. Which queue(s) would you issue the fence to to know whether 
>>>> the resource is ready for mapping? It can’t be “all the queues” 
>>>> because that’s crazy, it can’t be “none of the queues” because 
>>>> that’s not asynchronous, and it can’t be “the queues which are 
>>>> using the resource” because that might be 0 queues. There is no 
>>>> good answer.
>>>
>>> And putting “map()” on the queue itself makes no sense, because how 
>>> does the queue know when the mapping operation is safe to perform? 
>>> The resource might be in use on other queues. We shouldn’t make one 
>>> queue operation block on the progress of every other queue.
>>>
>>>>
>>>> Instead, the asynchronicity needs to be handled at the resource 
>>>> granularity. A method on the resource which returns a promise 
>>>> (which is what we have today in mapWriteAsync()) is the right 
>>>> granularity.
>>>>
>>>> So I think I agree that, for the problem of resource mapping, we 
>>>> can’t do better than what we have today.
>>>>
>>>> —Myles
>>>>
>>>>> On Mar 1, 2020, at 11:57 PM, Myles C. Maxfield 
>>>>> <mmaxfield@apple.com> wrote:
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>>> On Feb 28, 2020, at 12:06 PM, Ken Russell <kbr@google.com> wrote:
>>>>>>
>>>>>> 
>>>>>> On Fri, Feb 28, 2020 at 8:35 AM Corentin Wallez 
>>>>>> <cwallez@google.com <mailto:cwallez@google.com>> wrote:
>>>>>>
>>>>>>     On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com
>>>>>>     <mailto:kbr@google.com>> wrote:
>>>>>>
>>>>>>         On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez
>>>>>>         <cwallez@google.com <mailto:cwallez@google.com>> wrote:
>>>>>>
>>>>>>             Whoops, meant to send this to public-gpu.
>>>>>>
>>>>>>             On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez
>>>>>>             <cwallez@google.com <mailto:cwallez@google.com>> wrote:
>>>>>>
>>>>>>                 Hey all,
>>>>>>
>>>>>>                 I spent quite some time yesterday trying to
>>>>>>                 understand how alternatives like Synchronous
>>>>>>                 mapping #506
>>>>>>                 <https://github.com/gpuweb/gpuweb/pull/506> or
>>>>>>                 Failable mapping #511
>>>>>>                 <https://github.com/gpuweb/gpuweb/pull/511> could
>>>>>>                 be implemented and which modifications they would
>>>>>>                 need to be implementable. At the end of the day
>>>>>>                 an important property became clear for
>>>>>>                 implementations looking to minimize copies using
>>>>>>                 shared memory between the content process and the
>>>>>>                 GPU process. Below I'm using terminology from
>>>>>>                 buffer mapping proposals but the same is true for
>>>>>>                 proposals that would introduce new staging buffer
>>>>>>                 objects.
>>>>>>
>>>>>>                 *Accurate tracking on the content process of
>>>>>>                 which mappable buffers are in use is not tractable.*
>>>>>>
>>>>>>                 We want to avoid races so that the data written
>>>>>>                 by JS at the time it says call unmap is what gets
>>>>>>                 seen by the GPU until the next mapping operation
>>>>>>                 (either mapSync, or mapAsync promise resolution).
>>>>>>                 Without some asynchronous ownership transfer
>>>>>>                 (mapAsync like) I claim it is not tractable to
>>>>>>                 know accurately on the content-process side
>>>>>>                 whether it is safe to write to the shared memory
>>>>>>                 region.
>>>>>>
>>>>>>                 The first reason why it is not tractable, is that
>>>>>>                 we don't know accurately if queue submits that
>>>>>>                 use the mappable resource are finished. While it
>>>>>>                 is "easy" to follow progress of queues
>>>>>>                 themselves, knowing which submits use the
>>>>>>                 resource would require adding duplicated
>>>>>>                 content-side tracking in bindgroups, all
>>>>>>                 encoders, command buffers, bundles to know what
>>>>>>                 mappable resources they hold. This would be a
>>>>>>                 many-fold increase in the amount of tracking that
>>>>>>                 WebGPU requires on the content side.
>>>>>>
>>>>>>
>>>>>>         Is this true for all WebGPU implementations, or Chrome's
>>>>>>         implementation specifically? If such tracking were
>>>>>>         implemented, would it impose an unacceptable performance
>>>>>>         overhead? From my relatively uninformed perspective (I'm
>>>>>>         not a WebGPU or Dawn expert), conservatively gathering of
>>>>>>         all of the buffers or other mappable resources referenced
>>>>>>         by these data structures during command encoding and
>>>>>>         queue submission seems tractable. The tree of data
>>>>>>         structures that would have to be traversed doesn't seem
>>>>>>         that deep. The area that would have to be most optimized
>>>>>>         would be communication from the GPU process back to the
>>>>>>         content process that those buffers are no longer
>>>>>>         referenced. Most likely that would have to be handled
>>>>>>         with atomic operations in shared memory rather than via
>>>>>>         IPCs.
>>>>>>
>>>>>>
>>>>>>     This would be true to varying degrees for all multiprocess
>>>>>>     implementations. For example at the moment in Chromium the
>>>>>>     content side data for a GPUBindGroup is just a pointer handle
>>>>>>     that itself contains two uint32_t handles
>>>>>>
>>>>>
>>>>> I understand that this is an elegant design. However, “theoretical 
>>>>> purity” is literally the lowest priority in the Priority of 
>>>>> Constituencies 
>>>>> https://www.w3.org/TR/html-design-principles/#priority-of-constituencies
>>>>>
>>>>>>     , and that's the case for most other WebGPU Blink side
>>>>>>     objects. It's nice to be able to keep that if possible.
>>>>>>
>>>>>>
>>>>>> Understood that it would be nice to keep things simple.
>>>>>>
>>>>>> The question I'd like answered is - if this tracking were 
>>>>>> implemented, and if it wasn't too costly - then would the 
>>>>>> "Consequences for the existing proposals" below change? It seems 
>>>>>> to me that, for example, fallible mapping would be a viable 
>>>>>> possibility in that scenario. The analyses below assumed a priori 
>>>>>> that client-side tracking was intractable.
>>>>>>
>>>>>> A follow-on question - how difficult would it really be to add 
>>>>>> conservative client-side tracking? Could other WebGPU 
>>>>>> implementers provide their opinions too?
>>>>>>
>>>>>> -Ken
>>>>>>
>>>>>>
>>>>>>     The content-side can already see progress on the GPU without
>>>>>>     the need for atomics with GPUFence signaling which would also
>>>>>>     help make the effects of a buffer being no-longer in use
>>>>>>     happen when the application can observe the completion of the
>>>>>>     GPU work.
>>>>>>
>>>>>>
>>>>>>                 Second, even doing the tracking above is not
>>>>>>                 enough for accurate tracking because it doesn't
>>>>>>                 take into account error propagation. What if one
>>>>>>                 of the commands in the submit that uses the
>>>>>>                 mappable resource is an error? Is the resource
>>>>>>                 still considered in use by the content side when
>>>>>>                 the GPU process side does nothing? One solution
>>>>>>                 would be to duplicate all the validation on the
>>>>>>                 content side, but that's way too expensive, and
>>>>>>                 can't help with error objects due to OOM.
>>>>>>
>>>>>>
>>>>>>         Conservatively assuming that the resources are referenced
>>>>>>         even if the command will be an error seems fine from the
>>>>>>         standpoint of writing portable content. I agree
>>>>>>         wholeheartedly that the validation should not be
>>>>>>         duplicated on the content side, but it seems to me that a
>>>>>>         conservative approximation would work well enough. Errors
>>>>>>         are neither the common nor expected case.
>>>>>>
>>>>>>         Hoping that if we can iterate a bit on this discussion,
>>>>>>         that some of the analyses and conclusions below might be
>>>>>>         able to be revisited.
>>>>>>
>>>>>>     It's correct that the only path forward that's different than
>>>>>>     mapAsync is one where there is a conservative estimation of
>>>>>>     whether a buffer is mapped or not, that only has
>>>>>>     false-nagetives (i.e. the buffer can be considered in use
>>>>>>     when it is not, but not the contrary).
>>>>>>
>>>>>>         -Ken
>>>>>>
>>>>>>
>>>>>>                 *Consequences for the existing proposals.*
>>>>>>                 *
>>>>>>                 *
>>>>>>                 Synchronous mapping #506
>>>>>>                 <https://github.com/gpuweb/gpuweb/pull/506> has
>>>>>>                 two alternatives, either the mapping operation
>>>>>>                 blocks when the buffer is in use, or a new
>>>>>>                 staging area is returned that will be copied in
>>>>>>                 the buffer at a later time. Since accurate
>>>>>>                 tracking isn't possible, either WebGPU will most
>>>>>>                 often do a roundtrip to the GPU process (aka
>>>>>>                 block) or it will most often introduce an extra copy.
>>>>>>
>>>>>>                 Failable mapping #511
>>>>>>                 <https://github.com/gpuweb/gpuweb/pull/511> has
>>>>>>                 the mapping operation return null when the buffer
>>>>>>                 is in use. This requires knowing accurately when
>>>>>>                 it is in use and is not tractable.
>>>>>>
>>>>>>                 <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697>
>>>>>>                 Myles' comment about mapAsync
>>>>>>                 <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where
>>>>>>                 a desirable property would be that the mapAsync
>>>>>>                 promise resolve immediately if the buffer is
>>>>>>                 currently not in use. Guaranteeing this at the
>>>>>>                 spec level requires accurate tracking of which
>>>>>>                 buffers are in use so it isn't possible. There
>>>>>>                 needs to be at least one roundtrip to the GPU
>>>>>>                 process to "lock" the resource.
>>>>>>
>>>>>>                 *Another direction using APIs that allow wrapping
>>>>>>                 CPU memory into a buffer*
>>>>>>
>>>>>>                 The following APIs allow taking an OS memory
>>>>>>                 object, or just a pointer, and turn it into a GPU
>>>>>>                 resource:
>>>>>>
>>>>>>                   * ID3D12Device3::OpenExistingHeapFromAddress
>>>>>>                     <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress>
>>>>>>                   * MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:
>>>>>>                     <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc>
>>>>>>                   * VK_EXT_external_memory_host
>>>>>>                     <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host>
>>>>>>
>>>>>>                 These APIs would allow true zero-copy between JS
>>>>>>                 and the GPU and allocating staging memory in the
>>>>>>                 content process synchronously via the following
>>>>>>                 mechanisms:
>>>>>>
>>>>>>                   * In the content process, allocate
>>>>>>                     shared-memory between the content process and
>>>>>>                     GPU process (at least in Chromium that's
>>>>>>                     possible).
>>>>>>                   * Send the shared memory to the GPU process.
>>>>>>                   * In the GPU process create a new resource by
>>>>>>                     wrapping that shared memory region (or
>>>>>>                     allocating a new resource if it's not possible).
>>>>>>                   * In the content process, write to the memory
>>>>>>                     then send an unmap signal to the GPU process.
>>>>>>                   * On the GPU process, if wrapping was not
>>>>>>                     possible, copy from the shmem to the GPU
>>>>>>                     resource.
>>>>>>                   * Profit!
>>>>>>
>>>>>>                 An idea I was exploring is having something like
>>>>>>                 mapSync that can replaces the allocation of a
>>>>>>                 GPUBuffer to a new native buffer via the
>>>>>>                 mechanism described above. However a design
>>>>>>                 constraint we have been operating with is that a
>>>>>>                 WebGPU resource is exactly a native API resource
>>>>>>                 so that doesn't work either. (imagine we baked
>>>>>>                 bindgroups with the address of the buffer, we
>>>>>>                 want to avoid needing dirtying mechanisms).
>>>>>>
>>>>>>                 *Conclusion*
>>>>>>                 *
>>>>>>                 *
>>>>>>                 Like the other times I tried, I wasn't able to
>>>>>>                 come up with a better solution than mapAsync.
>>>>>>                 It's the only one that works so far but the
>>>>>>                 asynchrony makes it a bit difficult for people to
>>>>>>                 use so it'd be nice to have an alternative.
>>>>>>
>>>>>>                 At least I learnt an important design constraint,
>>>>>>                 and discovered that it is possible to wrap CPU
>>>>>>                 memory in a GPU resource to optimize things. Also
>>>>>>                 I started a HackMD to discuss tradeoffs again
>>>>>>                 <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>.
>>>>>>                 It doesn't have content but at least it has links
>>>>>>                 to all the proposals if you want to keep it as a
>>>>>>                 set of bookmarks.
>>>>>>
>>>>>>                 Cheers,
>>>>>>
>>>>>>                 Corentin
>>>>>>
Received on Monday, 2 March 2020 16:59:10 UTC