Minutes for the 2017-07-26 meeting from Corentin Wallez on 2017-07-27 (public-gpu@w3.org from July 2017)

From: Corentin Wallez <cwallez@google.com>
Date: Thu, 27 Jul 2017 16:53:40 -0400
To: public-gpu@w3.org
Message-ID: <CAGdfWNOdtqx6kwxZWhatQBjhgCgcQ5gzhDCqAh1h0siQAF0XiQ@mail.gmail.com>

GPU Web 2017-07-26

Chair: Dean

Scribe: Ken

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1ULhowm6p576LOL10DViB4Ct2-0qGJIoffWrqnFNHqU0/>Tentative
agenda

Administrative stuff (if any)

Individual design and prototype status

More on render passes.
-

Memory barriers
-

Agenda for next meeting

Attendance

Dean Jackson (Apple)
-

Julien Chaintron (Apple)
-

Myles C. Maxfield (Apple)
-

Austin Eng (Google)
-

Corentin Wallez (Google)
-

Kai Ninomiya (Google)
-

Ken Russell (Google)
-

Daniel Johnston (Intel)
-

Ben Constable (Microsoft)
-

Rafael Cintron (Microsoft)
-

Dzmitry Malyshau (Mozilla)
-

Fernando Serrano (Mozilla)
-

Jeff Gilbert (Mozilla)
-

Alex Kluge (Vizit Solutions)
-

Doug Twilleager (ZSpace)
-

Elviss Strazdiņš
-

Tyler Larson

Administrative items

Details on the September meeting:
https://lists.w3.org/Archives/Member/internal-gpu/2017Jul/0004.html
-

Reply to Corentin’s email if you’re going to attend
-

CW: Deadline for replying is September 3rd.
-

Software license update
-

DJ: told by Apple’s legal team; meeting next week with Microsoft,
Mozilla and Google. Seems to be going well

Individual design and prototype status

Google
-

CW: Kai has been working on making actual SwapChains in NXT
-

Can later try to port this on the web front and render to multiple
canvases with the same device
-

Austin has been adding a bunch of fixed function state: primitive
topology, … .
-

Trying to add essential features for a demo or game-like thing.
-

DJ: have you done this for WebGL? ImageBitmap, etc.?
-

Yes Justin Novosad has been working on this:
-

Either render on offscreen canvas, get an imageBitmap and
transfer it.
-

Either give control to the offscreen canvas and render to it.
Path is currently broken.
-

Not sure what the intended model is for rendering to multiple
GPUWeb canvases.
-

In other words, do we need OffscreenCanvas at all for GPUWeb if we
have SwapChains that can be targeted?
-

CW: Discuss at F2F?
-

Can ask for a context that’s an “NXT Surface” for a canvas and
get a swapchain object from it.
-

KN: create an NXT device out of thin air, and get SwapChains for
multiple canvases from it.
-

Make it so SwapChains can be transferred to workers and used
there. Should be like OffscreenCanvas.
-

DJ: sounds pretty similar to how Apple’s original prototype
worked. Made it look like WebGL way of setting things up
(fetch context,
etc.), and then ask for next texture to render into. Felt a bit weird.
-

Mozilla
-

DM: lot of internal work in the graphical abstraction layer, but not
a lot of new features or milestones to report
-

Microsoft
-

RC: no code contributions yet (to NXT, etc.)
-

Apple
-

DJ: No news, but started internal discussions on shading languages,
trying to get info on the sampleMask issue.

CW: JohnK will be discussing shading languages with us in 2 weeks (August
9), so will get info straight from the source.
Render passes

https://github.com/gpuweb/gpuweb/issues/23

DJ: mostly had consensus on this last week
-

What was left? Inheritance? Subpasses?
-

CW: people mostly agreed on having “multi-pass” render passes
-

Mostly to see if people had more to say after investigating internally
-

Tentatively: let’s do multi-pass render passes.
-

DJ: fine with Apple.
-

RC: did look at Corentin’s presentation. As long as it’s possible to
implement efficiently on D3D12, it’s fine.
-

MM: conceptually render passes are a good idea. Were discussing which
pieces of state should be part of a render pass.
-

DJ: what are next steps?
-

Should Corentin or someone define an API? Or move on with the
assumption that we have an agreement of the overall shape of this part?
-

KR / MM: seems too early to define an API
-

DM: agree, also tied in to memory barriers
-

CW: maybe we can make a comprehensive list of things we want to talk
about before looking at the shape of the API
-

DJ: Could do this on the wiki/github
-

MM: an issue per issue, then close them?
-

DM: Github milestones?
-

KR: spreadsheet?
-

DJ: volunteers?
-

MM: I can start!

Pipeline state objects

DJ: do we have complete consensus on pipeline state objects?
-

CW: still details like sampleMask
-

MM: probably should include depth/stencil state
-

Needs to be intersection of the three APIs. Because Metal’s doesn’t
include this, but WebGPU’s probably should since the other APIs
do have it
included.
-

CW: whatever makes backing APIs do no extra work at run time.
-

All the things that are given at pipeline creation state time in the
underlying APIs should be part of the pipeline state creation in WebGPU.
-

Except for Vulkan’s scissor state which is insane.
-

DJ: Metal team said they had little feedback from developers about
needing sampleMask. Think it is fine to leave it out in version one.
-

CW: Sounds fine to leave it out, can add it easily later.
-

DJ: Trying to get contacts of developers from the Metal team so we can
talk with them.
-

DM: In Vulkan, some pipeline state like separate blend state is gated on
some features in Vulkan, and might not be available on all hardware. Would
ofc require some Vulkan features to stay sane.
-

DJ: any hardware that doesn’t support it?
-

DM: Vulkan has a very rich set of flags for device features
-

CW: can look at the Vulkan hardware database and see if a feature is
available universally, ignoring obsolete hardware
-

For example, if ARM doesn’t support a feature, definitely want to
support it on ARM
-

Agreement from Apple, Mozilla

Memory barriers

Continuation from last week’s meeting

CW: were talking about implicit vs. explicit barriers
-

Implicit barriers make it easier on the app, give more leeway to the
implementation to optimize
-

Explicit: give more control to the application to batch memory
barriers together & have less CPU overhead
-

MM: why do they have less CPU overhead?
-

CW: if you have implicit ones and want to batch them, then if you’re the
driver, it’s somewhat easy. You have the command buffer in memory and can
go backward, say “want this memory barrier to be bigger”.
-

But if you’re doing this in an encoding API, you have to walk
backwards in the command buffer and then walk back forward.
-

Have looked at the Intel and AMD open-source Vulkan drivers.
-

Barriers in these drivers are just a couple of opcodes. Can edit if
you submitted them in the past.
-

But you can’t do this with any of the backing APIs.
-

MM: this is an argument for why the driver’s better at this than the
browser, not that the browser’s better at this than the author.
-

JG: automated batching is not an easy problem.
-

CW: Metal can do this more easily
-

The argument “Metal does this, so it’s possible” is not a good
argument because we don’t have as much control over this as the
driver does.
-

MM: sounds like somebody would need to try it. Having a fundamental
disagreement about how difficult this is.
-

CW: what do we want to know exactly? Build two prototypes, one with
implicit and one with explicit transitions?
-

DJ: it’s difficult to ask a browser to implement implicit transitions
and batching to prove that it works…
-

MM: at some point you have to check all the resources are in the right
state, as late as possible. WHy not just issue the barriers there?
-

CW: barriers are “flush the whole L1 cache”, for example. Want to
coalesce them.
-

MM: coalescing would happen as late as possible. Checks you would do are
at the same time that the coalescing would occur.
-

CW: disagree. Example: doing some dynamic mesh with compute. Have
writeable buffers. Then start using as vertex buffers. Barriers have to be
“inside” the command buffer, not at encoder start/end.
-

Analysis sees a draw, looks at needed resources. Sees a vertex
buffer. Issues a barrier.
-

It sees many such draw calls. Can’t do this in the past.
-

MM: you have a command buffer. Add as a writeable attachment. Then issue
a draw. Are these not the times you would both check, and issue a barrier?
-

CW: e.g. using 3 writeable vertex buffers
-

Use buffer 1: have to issue a barrier.
-

Don’t know the future: or have to do an analysis pass first, then
encode.
-

KN:
-

(W)rite (T)ransition (R)ead resource 1/2/3
-

Batched: W1 W2 W3 T1 T2 T3 R1 R2 R3
-

Just-in-time: W1 W2 W3 T1 R1 T2 R2 T3 R3
-

MM: would have to do this analysis at commit time
-

Don’t think this would be as complicated as CW says
-

BC: D3D has had runtimes which did this both implicitly (<= D3D11) and
explicitly (D3D12)
-

The code that does this implicitly is large and complex, in
particular handling multithreading
-

Would have to have basically the same system for Vulkan and D3D12
-

Easy to do for the simplistic case. But Corentin and others have
already pointed out where you’d need a complex system to make
this correct.
Not over-adding barriers, and never missing one.
-

D3D has experience doing this: debug layer. This is a significant
chunk of work, and don’t have to care about performance.
-

Alternate proposal of making the barriers explicit would only require
Apple to no-op them.
-

DJ: what if the barriers were forgotten in Corentin’s example?
-

CW: validation error.
-

DJ: what about seeing when the user should have put in a barrier but
didn’t?
-

CW: decided to validate resource usages / memory barriers.
-

Believe it doesn’t add much cost do some other feature -- like
destroying textures but still have them around.
-

On every command buffer, have to check liveness of textures. At the
same time, on the same cache line, can validate states.
-

KN: have to insert / execute some code to do this validation.
-

Think it’s roughly the same amount of time as the just-in-time usage
transition.
-

As we’ve learned from the Intel and AMD driver, this isn’t efficient.
Have to coalesce. Coalescing is expensive.
-

DJ: is the coalescing benefit in Corentin’s example that you would only
insert a barrier for buffers 1, 2, 3 since you know you’re done with them?
-

CW: yes. Rather than: barrier 1, draw 1, barrier 2, draw 2, barrier
3, draw 3.
-

DJ: b/c you’re doing dependency tracking anyway, have to track that
you’ve put the barrier in for B1, can’t you just do them all?
-

CW: no! you don’t know how they’re going to be used in the future.
And we shouldn’t guess what the hardware does.
-

BC: heard people mentioning cache flushes, etc. But another thing that
GPUs can do is change the compression format based on usage. Can see lots
of thrashing on some GPUs. On all hardware Ben knows about, these
transitions are needed.
-

DM: like to expand on explicit case. Not just barriers. Also: putting
things at the end of the usage as a resource, so that you give the hardware
time to do the transition.
-

AMD: use split barriers. Barrier start -> barrier end. Do
decompression, then flush caches at appropriate time. These sorts of
optimizations are not easy to automate. But: still easy to validate that
the transitions are in place.
-

Validation becomes increasingly easier than automatically putting
transitions in place.
-

DJ: leaning toward what Ben suggested, which is to go forward with
explicit barriers, and Metal will no-op them. Would still like to hear from
major developers using Metal and other APIs what performance hints they’re
hitting with Metal doing it behind the scenes. Dean assumes that Metal may
have some code for this but that it’ll never be as optimal as what the
developer could have done themselves.
-

MM: it’s not a no-op. The validation still has to occur. (Agreement)
-

CW: there’s CPU overhead on Metal, and there’s mental overhead for the
developer.
-

DJ: Metal wouldn’t have to do anything more than any other
implementation, just not actually call the barrier.
-

MM: DM just gave an example of a particular graphics card. Requiring the
web author to know the optimization possibilities of all the cards out
there is not reasonable. Think it would be better for the browser to do a
“good enough” job.
-

DJ: turning it around: it would be terrible if the developer optimized
for one piece of hardware and it behaved poorly on other hardware.
-

BC: have seen developers port things to D3D12. Have seen things they do
to make things performant pay off on multiple kinds of hardware. Don’t see
things behave radically differently on Intel, AMD, NVIDIA, etc. In general
are able to treat all the hardware types roughly the same.
-

Even UMA vs. non-UMA devices: there’s no way to do this in a
performant way with the same set of code.
-

You’ll have to test on: tiled/non-tiled, discrete/UMA.
-

You have to do this with modern graphics implementations.
-

This is why Vulkan/D3D12 exposed the knobs.
-

CW: something written with WebGPU should be portable. But we’re talking
about performance portability right here.
-

MM: seems unreasonable to ask every web author to test on 5 different
platforms.
-

CW: look at WebGL right now. NYTimes cares about it running; not the
last 5% of performance.
-

MM: if we only care about a few percent, the browser can just do it
automatically.
-

CW: one case is using Three.js, and you don’t care about the last ounce
of performance. Compare to Unity/Unreal, which would want coalescing, etc.
If you do implicit barriers, you prevent the engine authors from optimizing.
-

KR: sounds a lot like compiler optimizations like software pipelining,
inserting prefetches, etc. It’s hard. Seems a lot easier to just validate
transitions the developer inserted earlier.
-

BC: barriers *are* some of the harder things to deal with.
-

CW: also, this is an online problem -- affects every command buffer
submission.
-

DJ: have to adjourn discussion now -- continue August 9.

Agenda for next meeting

Start discussion on shaders - August 9th
-

F2F: Swap chain v offscreen canvas

Received on Thursday, 27 July 2017 20:54:32 UTC