- From: Corentin Wallez <cwallez@google.com>
- Date: Tue, 28 Nov 2017 15:23:22 -0500
- To: public-gpu <public-gpu@w3.org>
- Message-ID: <CAGdfWNMm9BK16fi06eFY=qd+Q61-RTxaCtxeB33h6z-FP0t1ig@mail.gmail.com>
GPU Web 2017-11-15
Chair: Dean
Scribe: Ken w/ help from Kai
Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1oI7_SbWYs2bO0D4NqjOd8L7wAqNAswZa2nDfE1xmy9M>
TL;DR
-
Recap of the WASM CG meeting.
-
Status updates:
-
Apple: Has codegen from Secure HLSL to SPIR-V
-
Google: Wrote docs about NXT memory barriers and buffer mapping.
-
Resource upload/mapping
-
Goal of NXT’s design is to minimize copies. Explanation of the
transition concept.
-
Having resources being mapped is the best way to minimize copies in
UMA.
-
Discussion of having all resources be stageable vs. requiring a
creation flag. Become discussion of hints vs. explicit parameters.
-
NXT’s design uses a callback to signal when the mapped buffer is
ready, discussion whether this is necessary to avoid CPU-GPU races.
-
No meeting next week.
Tentative agenda
-
Administrative stuff (if any)
-
Individual design and prototype status
-
Resource upload / mapping
-
Agenda for next meeting
Attendance
-
Apple
-
Dean Jackson
-
Myles C. Maxfield
-
Google
-
John Kessenich
-
Kai Ninomiya
-
Ken Russell
-
Microsoft
-
Chas Boyd
-
Rafael Cintron
-
Mozilla
-
Dzmitry Malyshau
-
Jeff Gilbert
-
Yandex
-
Kirill Dmitrenko
-
Elviss Strazdiņš
-
Markus Siglreithmaier
Administrative items
-
Later: brief recap of TPAC meeting
-
License: Dean: still no update from legal folks
-
Microsoft: no updates
-
Next week’s Thanksgiving in the USA; ~half group will be on vacation
-
Will skip next week’s meeting
-
Myles: we’ve also been on a particular cadence about shading languages
-
Topic for next discussion (in two weeks) will be shading languages
-
Myles: brief recap of TPAC meeting:
-
~4 broad topics covered:
-
How an object model would look in WebAssembly
-
WebAssembly today has 4 types. Two integer and two floating-point
types. No native support for a “WebGPUDevice”.
-
How would opaque types be handled?
-
Identified by number to the application. When WASM calls a WebGPU
function, passes the number, and the bindings look up in a
table and pass
the object to an implementation function.
-
Kai: Tables concept exists already. Right now there’s exactly one
table. Idea is to add multiple tables per type for WebGPUTexture,
WebGPUBuffer, etc.
-
How to do threading
-
In general: want the option to have threading in the WebGPU API
-
Kai: they’re close (or already have) SharedArrayBuffer shared
among threads including atomics
-
Kai: WASM group said they haven’t thought much about sharing
opaque objects between threads. SharedArrayBuffer is the only
thing that
can be shared concurrently between web workers. Transferables
allow objects
to be given between web workers.
-
We might want to add more “shareable” objects.
-
Property bags / extensions / expandability
-
Have structs / fields / know where these lie
-
In Chicago, agreed there should be a way to do extensions
-
Probably don’t want a model where a struct is as big as every
possible extension; probably want a more abstract “property bag” or
“dictionary” API where the WASM author specifies they want these few
extensions.
-
WASM doesn’t have any affordances for this right now but would be
happy to investigate
-
Ken: questions about how many primitives really need to be added
or whether much of this can be done in the bindings
-
MM: talked about this. Discussed 2 models, one which was a new
primitive, and the other which was to wrap the property bags
inside opaque
objects. Make calls to set/get things.
-
Kai: Like a C interface to JavaScript API. Emscripten already does
this for WebGL. The main issue is performance.
-
Buffer mapping
-
Would be great if a WASM program could load/store GPU memory using
wasm load/store ops, not function calls. Talked about it,
have some ideas,
but nothing exists yet.
-
Two models: special ArrayBuffer, where Map is an ArrayBuffer and
loads/stores are function calls. Second: WebAssembly could mmap() and
replace region of WASM heap with the mmap’ed GPU memory.
-
DM: Q about property bags:
-
Vulkan has a standard struct and a pointer at the end to chain
extensions together. If we make the property bag be a field
at the end of
the standard struct, while basics are regular fields, can
defer solution of
property bags until after MVP. Could consider this one of the choices.
-
In Mozilla’s WebGPU prototype, all objects have an ID; they’re a
bag on the GPU process side and can’t represent them on the
client side.
Would be nice to avoid double interaction of IDs, though it’s an impl
detail from the user’s point of view.
-
MM: correct that this is an impl detail. We shouldn’t impose
restrictions on browsers that don’t use that design. Could
be security
issues there, for example if have one GPU process and all
textures are
represented in a single array, because have to ensure that
can’t see other
pages’ textures.
Individual design and prototype status
-
Apple
-
MM: now have a SPIR-V codegen phase for JavaScript compiler (from
Secure HLSL)! Not fully featured but has compiled at least one shader.
-
This SPIR-V works in Apple’s example API structure for WebGPU that’s
checked in to the WebKit repo, implemented on top of Vulkan!
-
Now can use the best shading language in the world with the best API
structure in the world. :) :P
-
At some point will implement a Metal backend for this library.
-
Google
-
CW: written two documents that are updated views about memory
barriers and buffer mapping. Wrote them to hopefully use in today’s
discussion.
-
Uploaded a bunch of changes to index formats where they live in the
pipeline instead of the command buffer
-
Microsoft
-
RC: no status updates. Digesting emails and documents.
-
Mozilla
-
DM: nothing substantial
Resource upload / mapping
-
Documents from Google:
-
Memory barriers
https://docs.google.com/document/d/1k7lPmxP7M7MMQR4g210lNC5TPwmXCMLgKOQWNiuJxzA
-
Buffer Mapping
https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg
-
CW: in NXT the idea for data uploads was to have one path which
minimizes copies. To do this you have some sort of buffer mapping concept.
Want to eliminate copies from browser side into the WebGPU visible side.
-
So questions: how to make mmapping safe? And how to avoid memory races?
-
In NXT, each resource can be in combination of read-only formats, *or*
in a writeable format. This restriction *eliminates* data races.
-
When you map a buffer, that only works when it’s in the correct state.
It can’t be used for anything else except being mapped. No reading vertices
from it, etc. This interaction means that NXT is data race free in this
regard, as well as eliminates copies behind the scenes. Also, GPU won’t
read from the buffer until you’re done writing it.
-
MM: questions:
-
1. Docs sounded like a resource could be in a state where it’s
readable by the CPU and usable by the GPU?
-
A: KN: : when you create a resource you specify the allowed
usages. Then later you specify the current usage.
-
If a buffer has the allowed usage MapWrite, the only other valid
allowed usage is TransferSrc. Then, the current usage of the
buffer, at any
given time, can only be either MapWrite or TransferSrc.
-
CW: probably want buffers to be mappable and more than just
transfer sources or destinations. Gets tricky because there
are some other
interactions.
-
MM: that was my thought. If the only way to use the feature of
mapping data on the CPU was to “prepare” a future copy, which
is a command
enqueued in the command stream, you haven’t reduced the copies.
-
CW: there are always two copies. First, when you write the data on
the CPU: uncompress an image, etc. Then, on a discrete GPU, a
copy from CPU
visible memory to GPU local memory for best performance.
-
(When you map the buffer you write as much as possible into
GPU-visible memory. Decompressing data into GPU-visible
buffer. Copying
from CPU&GPU visible memory to GPU local memory.)
-
This is the minimal number of copies.
-
But: the current design isn’t great for UMA yet. Need to figure
out better how to make that work.
-
MM: Apple’s primary platform is UMA.
-
CW: for UMA: have an extension “you’re on UMA”? App would need to
do the right thing.
-
Second option: two types of mapping. (1) Transfer buffer only, app
has to do a copy; or (2) “staged”: let me map this buffer
even though it’s
GPU local (incurs extra staging buffer and copy on discrete,
but not on
UMA).
-
Think it’s similar to Apple’s approach where all 3 usages are
always allowed. Makes things simpler on GPUWeb impl side
because you know
what things have staging copies or not.
-
MM: why not allow all resources to have staging copies?
-
CW: it’s less explicit. It doesn’t want all resources to be
staged. Apps want to have control over this. If we make all resources
staged then it’s blurry.
-
JG: from OpenGL and D3D12, have types of buffers based on heap
type and/or usage hint during BufferData. Provides an
optimization hint
about how the resource will behave. You can do whatever you
want. Can map
upload heaps; it’ll be slow. But it lets the impl optimize
for uploads and
minimize the number of copies.
-
MM: think we should have hints about where things should be.
-
CW: hints aren’t great. The app should say “I’m going to use this
thing this way”. You need to support all paths in the impl
anyway. If the
app says that it’ll only use a resource in a particular way
then it can
optimize things much more.
-
CW: glBufferData(.., GL_STATIC_DRAW). Driver assumes it lives in
GPU local memory. But you can really do whatever you want. Many OpenGL
drivers have heuristics and use the hint to prime the
heuristic. But the
impl still needs to have usage counters, etc, and
optimizations. A hint
that doesn’t enforce anything makes things worse. Still have
to handle all
possible cases as the implementation.
-
CW: in ANGLE we have up to 4 copies because it could be used in 4
different ways. Since it’s just a “hint”, there’s no way to
eliminate this.
-
KR: A huge goal of new explicit API is to remove the possibility
of implicit shadow copies, and give the memory control to the
application.
-
JG: appreciate how it can make things easier if you can reliably
say “I just need a copy of this”, but at the same time the
primitive we
need is “i have a resource, and it’s either CPU mappable or
not”. Based on
CPU mappable, use a staging buffer or not. Sophisticated impl
might decide
more behind the scenes. Also possibility of polyfills: can copy to a
staging buffer and back, and give you a mmapp’ed copy later. But let’s
focus on the primitives the hardware needs to give to us.
-
MM: reducing the API to the lowest level primitives isn’t always
the best way to make an API. Many ways to judge an API.
-
JG: main idea is that you can do whatever you want if you have
these primitives. But if you use a higher level primitive
then you force
this decision on users of the API.
-
MM: there’s a difference between letting someone write a library
and forcing them to do so.
-
CW: hope was that the buffer mapping proposal would seem simple
enough and also provide low level primitives to satisfy both points of
view. Get the best performance but at the same time it’s easy to use.
-
In NXT, we have one extra affordance: sometimes you want to put
the data in the buffer right now, and not have to
implement ring buffers,
etc. Proposing: “put this data in this buffer right now”, too.
-
MM: might be a solution where both are available in the API. Not
sure we’ve done enough research to know. Myles would like to
see something
written about the UMA approach. Think it’s possible to come
to a compromise.
-
CW: AI: write proposal to cover UMA and minimizing number of
copies.
-
JG: question about callback mechanism. Why not just use fences and
signaling?
-
KN: NXT doesn’t have fences and signaling yet. :)
-
CW: also, with a callback, the buffer only becomes valid once the
callback is passed. Makes it easier in Chrome where we may need to do
fences, etc., and where memory may or may not become valid in a certain
order. Also complexities with multiple queues.
-
JG: callbacks add another weird lifetime management issue. What
thread do they run on, etc.
-
CW: it’s a Promise. The only reason it’s a callback is that there’s
no Promise handling in our code generators yet. Part of the UI that we
didn’t spend time on.
-
RC: how’s the callback implemented?
-
CW: Implemented in NXT by, on D3D12, incrementing a fence
periodically. Once the counter reaches a certain value (the
GPU stream has
executed past that point), the callback is fired.
-
RC: interesting that we just had the discussion about having a
library vs. making things easier, and here’s a way that NXT’s
making things
easier.
-
RC: D3D12 has a way to set an Event when a GPU instruction is
passed.
-
CW: might be able to do this.
-
DM: don’t understand why we’ve made a separate callback mechanism for
each resource. And different usage transitions, outside the command buffer.
-
JG: are callbacks a germane or incidental part of the proposal?
-
CW: our proposal does have a statement “you can now read the
pointer”. But the way it’s said isn’t important. Set a flag, call a
callback. etc.
-
DM’s question was about why there’s a transition at the device level
/ immediate command when other transitions are in command buffer.
-
It’s just the map operation that can’t be in the command buffer.
-
DM: so you have a separate mechanism for mapping where you could just
put a fence in the command buffer?
-
DM: in D3D12 and Vulkan, you’d wait on a fence that gates executing
the GPU command buffers. When wait is over, user makes sure that they’ve
transitioned the resource to a mappable state. Then they map it,
etc. Would
need some mechanism through fences to call back through the GPU anyway.
Seems stuff done for buffers is duplicating the mechanism.
-
CW: part about transition to MAP_READ state is because we are trying
to prevent data races (single-write-multiple-read pattern like Rust
borrowing). Need to flush CPU cache, etc.
-
CW: When you submit a command buffer that’s transitioned a bunch of
buffers to MAP_READ with this serial number on this queue --
when you map,
we make sure the app can’t see memory before it’s available, for both
security and portability.
-
DM: We are going to have a general mechanism for CPU to call back on
the GPU. Call map(), etc.
-
CW: how will the browser enforce that the memory is up to date for
the app?
-
CW: think we should enforce portability as long as it doesn’t have
performance impact. This is the same thing the app would do itself, but
enforce that the app can’t see e.g. uninitialized GPU memory.
-
MM: agree with CW: avoiding these data races is important.
-
CW: hope these documents help you agree that the performance is the
same as if the application did it itself.
-
DM: have yet to read all documents. May come with follow-on questions
in email.
-
MM: general philosophy is that if the browser can do the same thing the
web author is doing, the browser should do it, is the best decision.
-
DM: concerned about the redundancy in the API.
-
KR: Very complicated (in multiprocess browser) to avoid synchronous
stalls and provide a mapping primitive. Synchronous MapBuffer is
impossible. Browser either incurs extra copies, or stalls to sync between
GPU, GPU process, and JS process.
-
JG: it’s easier in WebGL because there’s a monotonic queue and a
before/after effect with fences. Less clear in Vulkan. Think there are ways
to do this which preserve the Map style API while still allowing remoting
implementations to do this well, capably and transparently. Relies heavily
on usage hints or attribute settings to indicate whether a buffer read will
happen or not, hold on to a shmem you’re going to copy into, holding on to
staging buffers. Issues about signaling the client. But think it can be
done.
Agenda for next meeting
-
Next week: no meeting (US Thanksgiving).
-
In two weeks: shading languages.
-
AI: Apple to provide a document on Secure HLSL
-
AI: Google to draft a small SPIR-V execution environment for WebGPU;
how it would be embedded, etc.
-
Next non-shader meeting:
-
AI: Google to make a proposal for UMA.
Received on Tuesday, 28 November 2017 20:24:09 UTC