- From: Dzmitry Malyshau <dmalyshau@mozilla.com>
- Date: Mon, 2 Mar 2020 11:58:57 -0500
- To: "Myles C. Maxfield" <mmaxfield@apple.com>
- Cc: Ken Russell <kbr@google.com>, Corentin Wallez <cwallez@google.com>, Jeff Gilbert <jgilbert@mozilla.com>, internal-gpu <internal-gpu@w3.org>, public-gpu <public-gpu@w3.org>
- Message-ID: <72ecbc51-7877-5593-35a3-fac5a3de257a@mozilla.com>
Myles, (comment is inlined) On 3/2/20 11:46 AM, Myles C. Maxfield wrote: > > >> On Mar 2, 2020, at 7:35 AM, Dzmitry Malyshau <dmalyshau@mozilla.com> >> wrote: >> >> >> >> Hi Myles, >> >> It's an important point to see how multi-queue will interact with it, >> and I agree relying on fences makes it complicated. >> >> What I think is missing from the discussion is some concrete >> use-cases that would be solved by one or the other solution. One such >> use-case we've been seeing in multiple ISV code bases is >> progressively filling up a uniform buffer with data from the CPU. >> Supposing the buffer is big enough, the user would go through the >> entities, for each they would append the uniform data to the mapped >> buffer and bind it with the dynamic offset in the command buffer. For >> the next frame, the user would use the same buffer, just continue >> writing data as if the buffer was circular. >> >> Current mapWriteAsync doesn't provide a good solution to this case. >> First, because the user would have to keep the whole buffer mapped, >> and thus our implementations would need to consider the whole buffer >> to be changed each frame. Second, because the user would need to >> rotate multiple buffers, and thus would duplicate any bind groups >> that contain them, if they want minimal number of copies. >> Alternatively, they can rotate only staging buffers and issue >> copyBufferToBuffer into the actual GPU uniform buffer each frame, but >> this approach has the extra copy, more memory, and all the complexity >> of buffer updates in it. >> >> The "guaranteed" mapping proposals appear to address this case >> better. The user would map the uniform buffer either for each entity, >> or for some chunks of memory. The implementations would have to >> guarantee that using specifically aligned offsets and checking for >> fences (in case of single queue) guarantees the mapping to be >> resolved. What's not clear to me is how many JS objects would be >> required to be created. For example, >> https://github.com/gpuweb/gpuweb/pull/511 returns an ArrayBuffer for >> every map(), and in practice they would have to create some typed >> views on top in order to fill in the contents. I suppose the user >> could map larger chunks at a time, theoretically, but it's certainly >> not the pit of success here. >> >> Finally, the `Queue.writeToBuffer` would naturally have the user >> providing sub-slices of their array buffers to update the relevant >> uniform buffer data for each entity. I don't think there is >> necessarily even an extra copy involved: the user could keep the >> uniform data around per entity on the CPU (in some ArrayBuffer >> objects), assuming that not all the data for all the objects change >> per frame. The arguable downside here is that the user wouldn't know >> is whether the data is getting written to the destination buffer >> directly, or placed in the staging area internally. >> >> > any architecture which utilizes a GPU process must necessarily not >> have truly synchronous maps, because the mapping operation has to be >> done by the GPU process, which necessitates a round trip. >> >> This isn't necessarily true. Unless you are running on D3D11, the GPU >> resource can be permanently mapped, so no preparation is required for >> using the map. >> > > Mapped buffers are populated with “pinned” memory pages in the kernel, > which represents pages which the kernel can’t manage by relocating, > etc. It would be a bad idea to unconditionally pin all pages from all > buffers for all time. In general, WebKit tries to minimize the amount > of pinned pages it uses. We are only talking about mappable buffers, not *all* the buffers. Moreover, createBufferMapped(), if it's here to stay, could be specified as one-time mappable, so you'd unpin the relevant memory pages after the initial mapping. So this leaves just the frequently mapped buffers to be pinned. > >> Saying that, I'm also recalling |VK_ACCESS_HOST_WRITE_BIT and >> ||VK_ACCESS_HOST_READ_BIT|, which need to be set in the access flags >> within pipeline barriers in order for the CPU to safely read from the >> buffer or write to it. We could always set them for the mappable >> buffers in the pipeline barriers at the end of submits. It's >> something to keep in mind. >> >> Thanks, >> >> Dzmitry| >> | >> >> || >> >> On 3/2/20 3:20 AM, Myles C. Maxfield wrote: >>> >>> >>>> On Mar 2, 2020, at 12:12 AM, Myles C. Maxfield >>>> <mmaxfield@apple.com> wrote: >>>> >>>> >>>> I agree with Ken that relating resources to queue submits in the >>>> web process is roughly a trivial amount of complexity. >>>> >>>> However, after thinking about this for a while, I believe: >>>> >>>> - any architecture which utilizes a GPU process must necessarily >>>> not have truly synchronous maps, because the mapping operation has >>>> to be done by the GPU process, which necessitates a round trip. >>>> This round trip shouldn’t be synchronous because that would hang >>>> the main thread. >>>> >>>> - if map operations are going to be asynchronous, they can’t use >>>> fences to encapsulate the asynchronicity, because they are the >>>> wrong level of granularity. Consider a world where we have multi >>>> queue. Which queue(s) would you issue the fence to to know whether >>>> the resource is ready for mapping? It can’t be “all the queues” >>>> because that’s crazy, it can’t be “none of the queues” because >>>> that’s not asynchronous, and it can’t be “the queues which are >>>> using the resource” because that might be 0 queues. There is no >>>> good answer. >>> >>> And putting “map()” on the queue itself makes no sense, because how >>> does the queue know when the mapping operation is safe to perform? >>> The resource might be in use on other queues. We shouldn’t make one >>> queue operation block on the progress of every other queue. >>> >>>> >>>> Instead, the asynchronicity needs to be handled at the resource >>>> granularity. A method on the resource which returns a promise >>>> (which is what we have today in mapWriteAsync()) is the right >>>> granularity. >>>> >>>> So I think I agree that, for the problem of resource mapping, we >>>> can’t do better than what we have today. >>>> >>>> —Myles >>>> >>>>> On Mar 1, 2020, at 11:57 PM, Myles C. Maxfield >>>>> <mmaxfield@apple.com> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> On Feb 28, 2020, at 12:06 PM, Ken Russell <kbr@google.com> wrote: >>>>>> >>>>>> >>>>>> On Fri, Feb 28, 2020 at 8:35 AM Corentin Wallez >>>>>> <cwallez@google.com <mailto:cwallez@google.com>> wrote: >>>>>> >>>>>> On Fri, Feb 28, 2020 at 1:32 AM Ken Russell <kbr@google.com >>>>>> <mailto:kbr@google.com>> wrote: >>>>>> >>>>>> On Thu, Feb 27, 2020 at 5:41 AM Corentin Wallez >>>>>> <cwallez@google.com <mailto:cwallez@google.com>> wrote: >>>>>> >>>>>> Whoops, meant to send this to public-gpu. >>>>>> >>>>>> On Thu, Feb 27, 2020 at 2:40 PM Corentin Wallez >>>>>> <cwallez@google.com <mailto:cwallez@google.com>> wrote: >>>>>> >>>>>> Hey all, >>>>>> >>>>>> I spent quite some time yesterday trying to >>>>>> understand how alternatives like Synchronous >>>>>> mapping #506 >>>>>> <https://github.com/gpuweb/gpuweb/pull/506> or >>>>>> Failable mapping #511 >>>>>> <https://github.com/gpuweb/gpuweb/pull/511> could >>>>>> be implemented and which modifications they would >>>>>> need to be implementable. At the end of the day >>>>>> an important property became clear for >>>>>> implementations looking to minimize copies using >>>>>> shared memory between the content process and the >>>>>> GPU process. Below I'm using terminology from >>>>>> buffer mapping proposals but the same is true for >>>>>> proposals that would introduce new staging buffer >>>>>> objects. >>>>>> >>>>>> *Accurate tracking on the content process of >>>>>> which mappable buffers are in use is not tractable.* >>>>>> >>>>>> We want to avoid races so that the data written >>>>>> by JS at the time it says call unmap is what gets >>>>>> seen by the GPU until the next mapping operation >>>>>> (either mapSync, or mapAsync promise resolution). >>>>>> Without some asynchronous ownership transfer >>>>>> (mapAsync like) I claim it is not tractable to >>>>>> know accurately on the content-process side >>>>>> whether it is safe to write to the shared memory >>>>>> region. >>>>>> >>>>>> The first reason why it is not tractable, is that >>>>>> we don't know accurately if queue submits that >>>>>> use the mappable resource are finished. While it >>>>>> is "easy" to follow progress of queues >>>>>> themselves, knowing which submits use the >>>>>> resource would require adding duplicated >>>>>> content-side tracking in bindgroups, all >>>>>> encoders, command buffers, bundles to know what >>>>>> mappable resources they hold. This would be a >>>>>> many-fold increase in the amount of tracking that >>>>>> WebGPU requires on the content side. >>>>>> >>>>>> >>>>>> Is this true for all WebGPU implementations, or Chrome's >>>>>> implementation specifically? If such tracking were >>>>>> implemented, would it impose an unacceptable performance >>>>>> overhead? From my relatively uninformed perspective (I'm >>>>>> not a WebGPU or Dawn expert), conservatively gathering of >>>>>> all of the buffers or other mappable resources referenced >>>>>> by these data structures during command encoding and >>>>>> queue submission seems tractable. The tree of data >>>>>> structures that would have to be traversed doesn't seem >>>>>> that deep. The area that would have to be most optimized >>>>>> would be communication from the GPU process back to the >>>>>> content process that those buffers are no longer >>>>>> referenced. Most likely that would have to be handled >>>>>> with atomic operations in shared memory rather than via >>>>>> IPCs. >>>>>> >>>>>> >>>>>> This would be true to varying degrees for all multiprocess >>>>>> implementations. For example at the moment in Chromium the >>>>>> content side data for a GPUBindGroup is just a pointer handle >>>>>> that itself contains two uint32_t handles >>>>>> >>>>> >>>>> I understand that this is an elegant design. However, “theoretical >>>>> purity” is literally the lowest priority in the Priority of >>>>> Constituencies >>>>> https://www.w3.org/TR/html-design-principles/#priority-of-constituencies >>>>> >>>>>> , and that's the case for most other WebGPU Blink side >>>>>> objects. It's nice to be able to keep that if possible. >>>>>> >>>>>> >>>>>> Understood that it would be nice to keep things simple. >>>>>> >>>>>> The question I'd like answered is - if this tracking were >>>>>> implemented, and if it wasn't too costly - then would the >>>>>> "Consequences for the existing proposals" below change? It seems >>>>>> to me that, for example, fallible mapping would be a viable >>>>>> possibility in that scenario. The analyses below assumed a priori >>>>>> that client-side tracking was intractable. >>>>>> >>>>>> A follow-on question - how difficult would it really be to add >>>>>> conservative client-side tracking? Could other WebGPU >>>>>> implementers provide their opinions too? >>>>>> >>>>>> -Ken >>>>>> >>>>>> >>>>>> The content-side can already see progress on the GPU without >>>>>> the need for atomics with GPUFence signaling which would also >>>>>> help make the effects of a buffer being no-longer in use >>>>>> happen when the application can observe the completion of the >>>>>> GPU work. >>>>>> >>>>>> >>>>>> Second, even doing the tracking above is not >>>>>> enough for accurate tracking because it doesn't >>>>>> take into account error propagation. What if one >>>>>> of the commands in the submit that uses the >>>>>> mappable resource is an error? Is the resource >>>>>> still considered in use by the content side when >>>>>> the GPU process side does nothing? One solution >>>>>> would be to duplicate all the validation on the >>>>>> content side, but that's way too expensive, and >>>>>> can't help with error objects due to OOM. >>>>>> >>>>>> >>>>>> Conservatively assuming that the resources are referenced >>>>>> even if the command will be an error seems fine from the >>>>>> standpoint of writing portable content. I agree >>>>>> wholeheartedly that the validation should not be >>>>>> duplicated on the content side, but it seems to me that a >>>>>> conservative approximation would work well enough. Errors >>>>>> are neither the common nor expected case. >>>>>> >>>>>> Hoping that if we can iterate a bit on this discussion, >>>>>> that some of the analyses and conclusions below might be >>>>>> able to be revisited. >>>>>> >>>>>> It's correct that the only path forward that's different than >>>>>> mapAsync is one where there is a conservative estimation of >>>>>> whether a buffer is mapped or not, that only has >>>>>> false-nagetives (i.e. the buffer can be considered in use >>>>>> when it is not, but not the contrary). >>>>>> >>>>>> -Ken >>>>>> >>>>>> >>>>>> *Consequences for the existing proposals.* >>>>>> * >>>>>> * >>>>>> Synchronous mapping #506 >>>>>> <https://github.com/gpuweb/gpuweb/pull/506> has >>>>>> two alternatives, either the mapping operation >>>>>> blocks when the buffer is in use, or a new >>>>>> staging area is returned that will be copied in >>>>>> the buffer at a later time. Since accurate >>>>>> tracking isn't possible, either WebGPU will most >>>>>> often do a roundtrip to the GPU process (aka >>>>>> block) or it will most often introduce an extra copy. >>>>>> >>>>>> Failable mapping #511 >>>>>> <https://github.com/gpuweb/gpuweb/pull/511> has >>>>>> the mapping operation return null when the buffer >>>>>> is in use. This requires knowing accurately when >>>>>> it is in use and is not tractable. >>>>>> >>>>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> >>>>>> Myles' comment about mapAsync >>>>>> <https://github.com/gpuweb/gpuweb/pull/506#issuecomment-590199697> where >>>>>> a desirable property would be that the mapAsync >>>>>> promise resolve immediately if the buffer is >>>>>> currently not in use. Guaranteeing this at the >>>>>> spec level requires accurate tracking of which >>>>>> buffers are in use so it isn't possible. There >>>>>> needs to be at least one roundtrip to the GPU >>>>>> process to "lock" the resource. >>>>>> >>>>>> *Another direction using APIs that allow wrapping >>>>>> CPU memory into a buffer* >>>>>> >>>>>> The following APIs allow taking an OS memory >>>>>> object, or just a pointer, and turn it into a GPU >>>>>> resource: >>>>>> >>>>>> * ID3D12Device3::OpenExistingHeapFromAddress >>>>>> <https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromaddress> >>>>>> * MTLDevice newBufferWithBytesNoCopy:length:options:deallocator: >>>>>> <https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc> >>>>>> * VK_EXT_external_memory_host >>>>>> <https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host> >>>>>> >>>>>> These APIs would allow true zero-copy between JS >>>>>> and the GPU and allocating staging memory in the >>>>>> content process synchronously via the following >>>>>> mechanisms: >>>>>> >>>>>> * In the content process, allocate >>>>>> shared-memory between the content process and >>>>>> GPU process (at least in Chromium that's >>>>>> possible). >>>>>> * Send the shared memory to the GPU process. >>>>>> * In the GPU process create a new resource by >>>>>> wrapping that shared memory region (or >>>>>> allocating a new resource if it's not possible). >>>>>> * In the content process, write to the memory >>>>>> then send an unmap signal to the GPU process. >>>>>> * On the GPU process, if wrapping was not >>>>>> possible, copy from the shmem to the GPU >>>>>> resource. >>>>>> * Profit! >>>>>> >>>>>> An idea I was exploring is having something like >>>>>> mapSync that can replaces the allocation of a >>>>>> GPUBuffer to a new native buffer via the >>>>>> mechanism described above. However a design >>>>>> constraint we have been operating with is that a >>>>>> WebGPU resource is exactly a native API resource >>>>>> so that doesn't work either. (imagine we baked >>>>>> bindgroups with the address of the buffer, we >>>>>> want to avoid needing dirtying mechanisms). >>>>>> >>>>>> *Conclusion* >>>>>> * >>>>>> * >>>>>> Like the other times I tried, I wasn't able to >>>>>> come up with a better solution than mapAsync. >>>>>> It's the only one that works so far but the >>>>>> asynchrony makes it a bit difficult for people to >>>>>> use so it'd be nice to have an alternative. >>>>>> >>>>>> At least I learnt an important design constraint, >>>>>> and discovered that it is possible to wrap CPU >>>>>> memory in a GPU resource to optimize things. Also >>>>>> I started a HackMD to discuss tradeoffs again >>>>>> <https://hackmd.io/qWmMfnFVRtyR0Q2HVSagOw?both>. >>>>>> It doesn't have content but at least it has links >>>>>> to all the proposals if you want to keep it as a >>>>>> set of bookmarks. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Corentin >>>>>>
Received on Monday, 2 March 2020 16:59:10 UTC