Re: NXT design for memory barriers and buffer mapping. from Dzmitry Malyshau on 2017-12-01 (public-gpu@w3.org from December 2017)

From: Dzmitry Malyshau <dmalyshau@mozilla.com>
Date: Fri, 1 Dec 2017 16:00:50 -0500
To: Corentin Wallez <cwallez@google.com>
Cc: public-gpu <public-gpu@w3.org>
Message-ID: <CAHnMvnLMZWpst02Kh5vbxzorXoNNZb5iGeUYSbZERtH5Uuopxw@mail.gmail.com>

Hi Corentin,

(branching from the root for the comments about the buffer mapping
document, as opposed to the memory barriers)

> The number of times data is copied before it reaches its final
destination is the biggest factor in upload performance.

I see this assumption being rooted deeply into the (amazing) WebGL work the
group was doing on the subject. It's based on the fact we don't control the
GL driver's side, which does more copying, renaming, and essentially
transferring to GPU. If we look at the problem from the next-gen native
API's point of view, I'd say that there is a more important factor than the
number of copies: avoid CPU/GPU synchronization stalls. All in all, I'd see
the following issues here:

   - stalls
   - copies
   - latency

> If WebGPU exposes a buffer mapping primitive, the application will be
able to decompress directly into either shared-memory (for Chrome) or
directly into GPU-visible memory, avoiding one copy.

Mapping a shared memory object sounds like a great idea. I believe this
should be the only way to transfer large amounts of data from CPU to GPU. A
list of all possible methods to do so, ordered from bigger to smaller data
sizes, could be:

   1. Mapping a shared memory object. Possibly, persistently.
   2. Updating buffer contents via command buffers, e.g.
   `vkCmdUpdateBuffer` which supports up to 64k. The graphics backend can then
   manage the staging area and schedule uploads internally, if it's not
   natively supported.
   3. Push constants.

> Since we assume the API tries to prevent data races on the GPU, it makes
sense to also prevent data races between the CPU and the GPU for the exact
same reasons.

As discussed previously, we can't (or rather don't want to) avoid all data
races of the GPU (example: UAV resource access). I'd argue that the browser
runtime doesn't necessarily need to enforce the rule here as much as it
needs to provide means/API for the user to implement it the way of no data
races, with some ability to validate the access.

> This means that the CPU should not be able to read or write buffer memory
while the GPU is using the buffer.

This discards one of the most important mapping scenario: persistent
mapping, where the user would repeatedly change parts of the mapped region
and communicate to the driver, which need to be invalidated on the GPU
side. The user can they tell WebGPU when a range of the mapped region is
changed. This can propagate directly through the code that copies from the
shared memory into runtime-managed space or GPU memory.

> We don’t see a compelling use for MapRead | MapWrite buffers

Yes, agree.

Thanks,
Dzmitry

On Tue, Nov 14, 2017 at 11:51 PM, Corentin Wallez <cwallez@google.com>
wrote:

> Hey all,
>
> We wrote some document to help everyone reason about NXT's proposals for
> memory barriers and resource upload /download. Unfortunately we still don't
> have a fleshed out proposal that minimizes the number of copies on UMA.
> Instead the docs focus on explaining our current design for resource
> upload/download and for memory barriers since they are very tied.
> Eventually we'll have these docs in MarkDown in some repo, either WebGPU's
> or NXT's.
>
>    - NXT "memory barriers"
>    <https://docs.google.com/document/d/1k7lPmxP7M7MMQR4g210lNC5TPwmXCMLgKOQWNiuJxzA>
>    <- Please read this first as buffer mapping depends on it.
>    - NXT buffer mapping
>    <https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg>
>
> Cheers,
>
> Corentin
>

Received on Friday, 1 December 2017 21:01:14 UTC