- From: Corentin Wallez <cwallez@google.com>
- Date: Mon, 11 Dec 2017 18:20:43 -0500
- To: public-gpu <public-gpu@w3.org>
- Message-ID: <CAGdfWNPHJxHV94sPQhT5kcBU8dGHM-oRjPf8M3E=s3R=xaTEBA@mail.gmail.com>
GPU Web 2017-12-06
Chair: Corentin
Scribe: Ken
Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/12JQDw8gNE2MslMWRcK5sjZ3CsafhzIRZ6RZDyX-8MZk>
TL;DR
-
Status updates
-
Apple: posted a Secure HLSL
<https://github.com/gpuweb/gpuweb/wiki/Secure-HLSL-Explainer> doc
-
Google: posted a UMA buffer mapping doc discussed in this meeting
-
Mozilla: specialization constants in gfx-rs
-
Buffer Mapping and UMA doc
<https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg/edit#heading=h.1qqspik2b22p>
-
Google proposing alternative 2, a UMA extension
-
Discussion of alternatives to callbacks for synchronization:
promises, or synchounous non-blocking queries.
-
WEBGL_get_buffer_sub_data_async discussions don’t bridge to WebGPU
because the structure of the APIs is different.
-
Should ask ISV input to know what’s most useful to them.
-
Persistently mapped buffer (PMB) is important to applications. The
proposal doesn’t expose them directly
-
Concern that PMBs wouldn’t be device independent.
-
One more discussion of device independence vs. best performance.
Suggestion we have opt-in device dependent behavior.
-
Do we want to be able to map-write to half a buffer while the
other half is in use?
Tentative agenda
-
Buffer mapping and UMA
-
Agenda for next meeting
Attendance
-
Apple
-
Dean Jackson
-
Myles C. Maxfield
-
Theresa O'Connor
-
Google
-
Corentin Wallez
-
Victor Miura
-
Kai Ninomiya
-
Ken Russell
-
Victor Miura
-
Microsoft
-
Rafael Cintron
-
Mozilla
-
Dzmitry Malyshau
-
Jeff Gilbert
-
Markus Siglreithmaier
-
Yandex
-
Kirill Dmitrenko
-
Elviss Strazdiņš
Buffer mapping and UMA
Google’s buffer mapping doc with UMA alternative
<https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg/edit#heading=h.1qqspik2b22p>
-
CW: For UMA vs. discrete GPUs:
-
They’re contrary to one another. The thing to be efficient on UMA is
different from discrete GPUs.
-
Want one code path that is efficient on both
-
Most efficient: either have two code paths, or have all resources be
mappable
-
Creation flag whether a resource is mappable or not
-
Either: UMA extension that allows mapping resources, or implicit
staging on the discrete GPU.
-
There was a third alternative: “I want to be able to map this and
maybe it’ll be staged”, but that was strictly inferior.
-
MM: upon reading the 3 options was wondering: seemed all of the
affordances for avoiding CPU-GPU data races had been removed from the UMA
code, is that true?
-
CW: no, it’s the map ext async mechanism that was presented at the
start of the document.
-
It’s a question of what’s allowed on what type of resource
-
Alternative 3: all resources can do anything. On UMA, can say “Map
Write Async”. Get pointer. On UMA, writes directly to resource. Callback
mechanism ensures no data races. On discrete, gives you back a staging
buffer instead. When you unmap, it does a copy.
-
Is it clear this still prevents data races?
-
JG: why are we using async?
-
MM: because the GPU might be using the resource at the time the CPU
wants it
-
CW: in NXT we want to eliminate data races
-
JG: usually have read/write exclusion. Don’t need async for that.
-
CW: yes, we could tell the application that you have to transition to
MAP_READ usage and then wait for resource to be no longer used, then map
synchronously. Would return if can’t guarantee simultaneous use.
-
Then app wouldn’t have to guess when WebGPU thinks it’s safe to write
the resource. That’s why it’s async.
-
JG: callbacks are traditionally tricky to put into an API because you
don’t know what’s on the call stack when the callback’s invoked.
-
MM: we get to design this
-
JG: it’s simpler if you don’t have these things.
-
CW: what about a Promise?
-
JG: same thing, more steps. Seems unnecessary. If you’re writing data
out to the GPU, you WRITE_DISCARD something, write into it, and
send it off
to be committed into the GPU resources. In particular, why make writes
async?
-
TOC: it’s unusual on the web for things to not be async
-
JG: it’s unusual to do low-level graphics APIs on the web
-
CW: there’s also MapWriteSync. Pipeline the write if necessary,
otherwise schedule the write now.
-
To be race-free, need writeback stuff only when it’s in use by the
GPU. How does the app know the resource isn’t in use by the GPU any more?
-
Can have a polling loop. But the callback tells you when it’s ready.
Understood that callbacks aren’t preferred.
-
JG: think they’re not necessary. In investigation for WebGL’s
GetBufferSubDataAsync, didn’t find any advantage for doing any writes
async. If you want to use the same shmem and a deep enough pipeline that
you want to rewrite into the shmem before it’s committed…
-
MM: this is one of the general discussions we’ve had between ease of
implementability vs. use. From web’s standpoint it’s better to
not have to
ask every frame “is this resource ready”?
-
KD: if you have async then you’re typically forced to wait for the
callback, put the operation into a queue, and schedule it. If
you’re in the
middle of an animation you don’t want to do expensive writes.
There will be
some scheduling around it anyway. Most applications carefully schedule
expensive operations like writes.
-
VM: wouldn’t an application have to use fences/queries to ensure a
buffer is writable?
-
JG: for reading yes, but for writing no. Say you update skinning data
by splatting the data into a buffer. Next frame, you send a new update of
the skinning data to the GPU. If it’s not ready to be written you would
stall.
-
VM: you would use multi-buffer to ensure you don’t stall, and use
some sort of synchronization.
-
RC: think Kirill is on the right track. In a typical web app you
render in response to requestAnimationFrame. When rAF comes, you
ask “which
buffers can I use?” and check my own data buffers to figure out which can
be used. Would use multiple buffers and put them on some list when their
callbacks are called. Not much different from exposing to the user “this
buffer is ready/not ready”.
-
MM: agree with Rafael
-
CW: agree with everything except that the app could do the same with
fences, etc. Question is: if app doesn’t put in fences at all, puts a lot
of pressure on the implementation to order things properly. Guaranteeing
e.g. “all buffers before this fence are now ready”.
-
RC: just saying that a callback thing is worse or the same amount of
effort as their managing their fences. Or we have to arrange our
own fences
so that we have enough bookkeeping to answer the user’s question. In
general, against the use of callbacks. Do enough bookkeeping to
tell users
the current state of things.
-
MM: that bookkeeping you describe is exactly that infrastructure in
Promises.
-
JG: would have to track which fences have passed, ... , and is the
buffer in use / ready to write to. Orthogonal to promises/callbacks/etc.
Track which fences have passed. Be able to tell if this buffer is ready.
Can: trigger callback, set a boolean, trigger a promise.
-
MM: this sync problem needs to happen somewhere. The browser can do
it just as well as the app.
-
JG: I found that using a synchronous command structure worked just as
well.
-
KR: Things are not about getBufferSubDataAsync. We agree your
proposal would be possible to do. We are concerned about how
easy it is to
fall off the fast path. For the readback case you need to do the readback
and wait for a fence to pass before you can read the data. We would like
things to be more explicit than that with an open query for the readback,
found it makes the tracking easier in our implementation.
-
JG: Appreciate the feedback but GL is a well ordered API so for GL
using fence should be ok.
-
KR: In our implementation it makes things harder with the ordering of
fences and have to delay things.
-
CW: Let’s move on because a lot of these details are very WebGL
specific.
-
KR: The question is whether we should do async vs. sync for WebGPU.
For developer convenience it seems we should follow the native explicit
APIs and have a synchronous API.
-
VM: may want to talk with ISVs about what they expect. They probably
have a synchronous pipeline. They may use async callbacks...
-
CW: ISVs will probably have strong opinions about this. Was assuming
that async was best for the web, though sync might work. Whether you use
callbacks, syncs, promises, etc. – can solve these differences –
are there
comments on the UMA proposal?
-
RC: to what degree should we hold the developer’s hand and make sure
they can’t screw it up? Or if they screw it up and things don’t
look good,
is that OK? Should we go the extra mile to do validation so they
can’t mess
up?
-
MM: we should try very hard to make it very difficult for users to
shoot themselves in the foot. Unless I’ve misunderstood, should be
impossible for API to block the main thread. A sync API would
cause the API
to block, right?
-
CW: no, the sync version would be “can I map you now”? And if you try
to map while the answer is “no”, you get null back and an error. All
non-blocking.
-
CW: agree the API should prevent shooting yourself in the foot. Can
see the difference between discrete GPU and UMA arch. On
discrete GPU your
app might work b/c your GPU has caches which hide things. On UMA, your
writes are directly visible to the GPU, might be broken on UMA
only. If you
have a persistently mapped buffer that you can write to while the GPU’s
using it, hard or impossible to write a validation layer.
-
DM: can’t we prevent persistent mapping in general?
-
MM: yes, if you can keep something mapped for 1000 frames, you’ll run
into sync problems on some drivers.
-
DM: it’s a really powerful use case for some users. Not something
that can be discarded trivially. Games will like to update parts of the
texture atlas while others are in use.
-
MM: there’s a general principle that the web’s device independent. If
you can write an app that works on your device but broken on all others
that’s bad.
-
JG: Already made the Web device dependent
-
KR: Not true, in WebGL we got rid of MapBuffer because in Chromium it
is not possible to have these synchronous APIs. A lot of work in WebGL is
to make the API device independent.
-
JG: For example different limits in WebGL.
-
CW: I think we should learn from WebGL. If you want something above
the limits you should have to ask for it. It’s a different discussion.
-
VM: Strongly opposed to fragmentation. Need to learn from the lessons
we’ve learned. The more we can ensure write-once, run-everywhere across
browsers and devices, the more this API will see adoption.
-
JG: past a certain point WORA doesn’t work for performant
applications. Have to compromise on this a little bit for best
performance.
-
VM: it’s a question of supporting the few or the many. Most
developers can’t test on all devices. Strongly feel we should enable apps
to work by default.
-
JG: yes. And if they use Three.js Version 2.0, they will. Don’t want
to do too much handholding because the ecosystem will do a
better job than
we can.
-
CW: the goal isn’t to make an API that’s easy to use. Even if you’re
an expert or a beginner, if you make an app that doesn’t ask for
extensions
or limits, it should work on another machine.
-
JG: agree.
-
CW: not sure that the app will be optimally performant, but it will
work. Goal is to make it opt-in to be device dependent.
-
VM: we’re making user agents here. Try to prevent bad experiences for
users. Shouldn’t be just up to the developer to “not mess it up”.
-
JG: strongly caution against going too far in that regard. Preventing
screw-ups also prevent well-meaning developers from doing a good
engineering job.
-
CW: goal is to have low-level, deterministic primitives.
-
CW: mapping a buffer persistently is scary from this standpoint.
Opting the developer in to races.
-
JG: only if you allow multiple writes at the same time. Can we have a
single-writer API? Read-write mutually exclusive.
-
MM: thought the problem was CPU vs. GPU collisions.
-
VM: this was a use case requested earlier. Writing to half a buffer
while the other half’s in use. Is this something we want?
-
MM: maybe can have a system that makes this well defined. Partition
buffers?
-
KN: think it’s not that difficult to make something that works this
way but might have performance penalties.
-
CW: can make sub-buffers and persistently map a larger buffer. Can
get in the way of robust buffer access checks done by the hardware. Can
round-robin between multiple buffers.
-
MM: sounds like there’s a persistent mapping but it’s in the
implementation?
-
CW: yes. Either you can map the whole buffer at once, can’t really do
a ring buffer (instead ring buffer of buffers). But there’s an
API “I want
to update this sub-part of a buffer now”. Does a copy.
-
RC: so in your implementation the “buffer” the user sees is a window
into a bigger API buffer?
-
CW: no, two mechanisms. Can update a buffer by mapping it – the
discussion earlier– only map the whole thing. But, if that
annoys you, you
can say “I want to put this data at that offset in this buffer”.
Essentially BufferSubData.
-
RC: does it let you do that while it’s in use in the GPU?
-
CW: yes. To do that you need the buffer in the TRANSFER_DESTINATION
usage.
-
MM: agree with general tenets of this approach - of preventing you
from writing into buffers that are in use by the GPU. Need to look at
general mechanism in more detail.
-
RC: will our code for doing this automatically do better than code
written by an application that knows what its intent is?
-
MM: depends on what you mean by “better”. Portable? definitely. If
you’re writing for one specific device might be able to do better.
-
RC: meaning that the web developer might need fewer fences, group
things using fewer primitives.
-
VM: interesting presentation from AMD: if you want the optimal thing
might have to present a graph to the API. Neither the driver nor the
developer can do the best thing for each architecture.
-
JG: how much of that graph is what we get from submitting command
buffers with dependencies?
-
CW: let’s not get into a task graph API. The industry hasn’t
converged yet.
-
JG: this is a task graph API already.
-
CW: NXT is an on-line task graph API. AMD’s proposal was an offline
plus online task graph API.
-
JG: Vulkan’s a task graph API
-
KN: AMD was saying the way Vulkan’s task graph API isn’t enough
– need more higher level stuff.
-
VM: need hindsight to do the optimal thing. Vulkan’s very low level.
Have to expose more and more flags to the developer to do the
right thing.
Also can’t express future architectures.
-
JG: please put a link to that in the minutes
-
CW: will try to, not sure it was public.
-
RC: if we generalize this and make it so the web developer can’t
shoot themselves in the foot, will their code be as performant as a
developer doing it themselves?
-
CW: no. The best we can do is the best the developer can do. The
question is how much we lose. I think we’ll lose very little.
The worst you
do is a ring buffer of buffers instead of a single buffer.
That’s probably
the biggest loss. Aside from that, we’re probably doing what any sensible
app would do.
-
CW: while doing NXT’s Vulkan backend, have been doing this buffer
upload stuff, and have been writing code exactly like what an application
would do. We end up doing per-resource tracking, while the app could
plausibly do this at a coarser granularity.
-
RC / CW: when is this checking done? Queue submits.
-
RC: if we think that we won’t lose much perf, there’s a lot to be
said for things to be more portable and have fewer bug reports
from people
about this and it’s their bookkeeping.
-
JG: would like to learn more about when this would be the case and
what guarantees the APIs we’re building on top of have. Would be
good if we
didn’t have to lock things down too badly.
-
KD: can we disallow people from shooting themselves in the foot on
the web, but allow opt-in performance?
-
CW: scary because everyone will opt-in to the high performance unsafe
mode. But, seems OK to add more knobs to apps that request them.
-
CW: goes back to UMA and buffer mapping. That’s why we’d like to have
an UMA extension which provides that functionality. Opting into footguns
that would make apps non-portable don’t sound OK.
-
JG: you’re saying, better to have performance footguns than
correctness footguns?
-
CW: yes. If you have correctness footguns that give you 1% more perf,
everyone will use them and the apps will be nonportable.
-
JG: for some developers, performance is part of the minimum level
product. Let’s not throw it out blindly. Should be trying to avoid
performance cliffs.
-
Google / CW: still somewhat happy with our proposal (async vs. sync is a
problem that we’ll have to decide). Concern about persistent buffers used
for ring buffers. Doesn’t account for sub-buffer usage. Could be a problem
for ISVs but aren’t sure. Our points: avoid footguns but have no perf
impact from avoiding footguns.
-
KR: think sync uplodads, async readbacks.
-
KN: in NXT, have both async and sync uploads in the design. Think
async uploads have uses but not sure.
-
Apple / MM: should be difficult to make an app where uploads/downloads
are non-portable. Should be an async API. Should be one code path for UMA
and discrete. Safety and portability are important and should not be
sacrificed. Some small perf penalty is worth that correctness.
-
Microsoft / RC: if we can make uploads portable and not impact perf then
we should do that. If we do UMA as an extension, let’s do that. They added
it to D3D12 because they thought there was a need for it.
-
Mozilla / JG: struggling to apply WebGL investigation to these APIs. Big
Q is what ordering guarantees look like. Hope there’s a good solution that
keeps these looser ordering guarantees in mind and allows very safe
read/write exclusion that makes sense and prevent footguns. Think there’s
not a strong benefit for async uploads. Think sync uploads are just as fast
and don’t hurt. Readbacks, not so sure, esp. because orderings are looser.
-
Mozilla / DM: explicit memory management is one of the biggest
advantages of the low level APIs. Persistent mapping is a critical feature.
Would prefer to have only two timelines, the app side and one per hardware
queue that executes it. This means a sync based API is preferred. (Can I
map/use this.)
Agenda for next meeting
Next week: shading languages.
Received on Monday, 11 December 2017 23:21:33 UTC