Re: NXT design for memory barriers and buffer mapping. from Corentin Wallez on 2017-12-05 (public-gpu@w3.org from December 2017)

From: Corentin Wallez <cwallez@google.com>
Date: Tue, 5 Dec 2017 18:47:03 -0500
To: Dzmitry Malyshau <dmalyshau@mozilla.com>
Cc: public-gpu <public-gpu@w3.org>
Message-ID: <CAGdfWNPhZbah=Hfc4mPd-NB2tM4GiLiR3q1DhsowrfhXQOHvhg@mail.gmail.com>
On Fri, Dec 1, 2017 at 4:00 PM, Dzmitry Malyshau <dmalyshau@mozilla.com>
wrote:

> Hi Corentin,
>
> (branching from the root for the comments about the buffer mapping
> document, as opposed to the memory barriers)
>
> > The number of times data is copied before it reaches its final
> destination is the biggest factor in upload performance.
>
> I see this assumption being rooted deeply into the (amazing) WebGL work
> the group was doing on the subject. It's based on the fact we don't control
> the GL driver's side, which does more copying, renaming, and essentially
> transferring to GPU. If we look at the problem from the next-gen native
> API's point of view, I'd say that there is a more important factor than the
> number of copies: avoid CPU/GPU synchronization stalls. All in all, I'd see
> the following issues here:
>
>    - stalls
>    - copies
>    - latency
>
> Agreed. We should have called stalls out more explicitly, but our design
already makes it so there's no stalls. The next thing is copies. (and
the're especially important costly on mobile).


> > If WebGPU exposes a buffer mapping primitive, the application will be
> able to decompress directly into either shared-memory (for Chrome) or
> directly into GPU-visible memory, avoiding one copy.
>
> Mapping a shared memory object sounds like a great idea. I believe this
> should be the only way to transfer large amounts of data from CPU to GPU. A
> list of all possible methods to do so, ordered from bigger to smaller data
> sizes, could be:
>
>    1. Mapping a shared memory object. Possibly, persistently.
>    2. Updating buffer contents via command buffers, e.g.
>    `vkCmdUpdateBuffer` which supports up to 64k. The graphics backend can then
>    manage the staging area and schedule uploads internally, if it's not
>    natively supported.
>    3. Push constants.
>
>
> +1. APIs `vkCmdUpdateBuffer` doesn't have equivalent in other APIs
(nothing in D3D12, `setBytes` in Metal that replaces the whole buffer). So
if WebGPU is to have a pipelined copy like this, I suggest we do it
post-MVP.

> Since we assume the API tries to prevent data races on the GPU, it makes
> sense to also prevent data races between the CPU and the GPU for the exact
> same reasons.
>
> As discussed previously, we can't (or rather don't want to) avoid all data
> races of the GPU (example: UAV resource access). I'd argue that the browser
> runtime doesn't necessarily need to enforce the rule here as much as it
> needs to provide means/API for the user to implement it the way of no data
> races, with some ability to validate the access.
>
> > This means that the CPU should not be able to read or write buffer
> memory while the GPU is using the buffer.
>
> This discards one of the most important mapping scenario: persistent
> mapping, where the user would repeatedly change parts of the mapped region
> and communicate to the driver, which need to be invalidated on the GPU
> side. The user can they tell WebGPU when a range of the mapped region is
> changed. This can propagate directly through the code that copies from the
> shared memory into runtime-managed space or GPU memory.
>
> Racy UAV access are impossible to validate and most uses of the API won't
have them. GPU-CPU races however look to be much more common that's why
we're thinking we should prevent them if we can do it with a tiny
performance and/or convenience hit. We think this is doable with what we
presented where instead of having a single ringbuffer and synchronization
done manually by the application, there's a ringbuffer of buffers and
safe-guards provided by the API. The range discard you're mentioning can be
done with "MapWriteSync".


> > We don’t see a compelling use for MapRead | MapWrite buffers
>
> Yes, agree.
>
> Thanks,
> Dzmitry
>
>
> On Tue, Nov 14, 2017 at 11:51 PM, Corentin Wallez <cwallez@google.com>
> wrote:
>
>> Hey all,
>>
>> We wrote some document to help everyone reason about NXT's proposals for
>> memory barriers and resource upload /download. Unfortunately we still don't
>> have a fleshed out proposal that minimizes the number of copies on UMA.
>> Instead the docs focus on explaining our current design for resource
>> upload/download and for memory barriers since they are very tied.
>> Eventually we'll have these docs in MarkDown in some repo, either WebGPU's
>> or NXT's.
>>
>>    - NXT "memory barriers"
>>    <https://docs.google.com/document/d/1k7lPmxP7M7MMQR4g210lNC5TPwmXCMLgKOQWNiuJxzA>
>>    <- Please read this first as buffer mapping depends on it.
>>    - NXT buffer mapping
>>    <https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg>
>>
>> Cheers,
>>
>> Corentin
>>
>
>
Received on Tuesday, 5 December 2017 23:47:48 UTC