Minutes for the 2017-07-19 meeting from Corentin Wallez on 2017-07-21 (public-gpu@w3.org from July 2017)

From: Corentin Wallez <cwallez@google.com>
Date: Fri, 21 Jul 2017 16:45:55 -0400
To: public-gpu@w3.org
Message-ID: <CAGdfWNPtZDJ2wYku_x65dHPdvZdhK5+t+LeFK_jX5RCx2c2Fwg@mail.gmail.com>

GPU Web 2017-07-19

Chair: Corentin & Dean

Scribe: Dean with help from Ken

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1FupUhxJL7TfzFSgofShxqYvAlW_zHlZphKB8k7Prwzg/edit>Tentative
agenda

Administrative stuff (if any)

Individual design and prototype status
-

Renderpasses / rendertargets
-

Pipeline state details

Agenda for next meeting

Attendance

Dean Jackson (Apple)
-

Myles C. Maxfield (Apple)
-

Theresa O'Connor (Apple)
-

Warren Moore (Apple)
-

Austin Eng (Google)
-

Corentin Wallez (Google)
-

Kai Ninomiya (Google)
-

Ken Russell (Google)
-

Ricardo Cabello (Google)
-

Rafael Cintron (Microsoft)
-

Dzmitry Malyshau (Mozilla)
-

Jeff Gilbert (Mozilla)
-

Alex Kluge (Vizit Solutions)
-

Kirill Dmitrenko (Yandex)
-

Doug Twilleager (ZSpace)
-

Elviss Strazdiņš
-

Joshua Groves
-

Tyler Larson

Administrative items

CW: I sent email about the Sept Chicago F2F meeting. Please reply to me
if you’re coming, either in person or by hangouts.
-

CW: I’ve also put up an agenda document, that we will fill in before the
meeting.

DJ: TPAC, we will meet with WebASM. I’ll coordinate a time and let
internal-gpu know.

Individual design and prototype status

CW: Google have spent time on texture to buffer copies and the D3D12
constraints there. The buffer row pitch we think we have to add to WebGPU
but in NXT we found a way to not have to add the bufer offset alignment by
sometimes splitting copies in two.
-

DM: Mozilla has looked at Metal and D3D12 backend. And got Vulkan
descriptor pools mapped to Metal Indirect Argument Buffers. It seems to
work well, but isn’t strongly tested.
-

DJ: IAB will only work in High Sierra and above so we might need to
fallback to previous binding model before that.
-

DM: In our prototype it is compile time flag that allows choosing
between IAB and “old Metal”.
-

MM: One difference is that IABs can be written by the GPU on a subset of
the hardware. If you are talking about GPU filling then that’s another
approach. Vulkan does not have this, but D3D12 does.
-

CW: I think this is very advanced and we shouldn’t look at it for now in
WebGPU.
-

MM: If we’re talking about CPU binding, then using IABs isn’t always
necessary.
-

CW: We’ve been able to implement the binding model we presented in NXT
using Metal’s older binding model.
-

MM: Yeah either way is fine.

Pipeline state details

DM and JG have worked on the issue in GitHub.
-

https://github.com/gpuweb/gpuweb/issues/26
-

CM: This issue highlights the difference between all three APIs.
-

DM: It seems pretty clear what the overlap is. Vulkan will need its
dynamic state capabilities.
-

DM: Also need to remove some features the D3D12 doesn’t support like
TRIANGLE FAN, separate face stencil and mask,...
-

DM: Instance rate and Sample mask are the difficult ones. Vulkan doesn’t
support an instance rate more than 1. Sample mask is not present in Metal.
-

DM: For the MVP we could support an instance rate of one.
-

CW: What is sample mask?
-

DM: When the sample coverage is computed by the rasterizer, it then uses
the mask to limit the samples you render.
-

MM: Metal doesn’t have that concept. We shouldn’t support it.
-

DM: Could have the device capabilities expose “bool
isSampleMaskSupported”
-

DM: Didn’t look at tesselation and ..
-

MM: Don’t think tesselation is necessary for MVP.
-

CW: Especially since it different between (D3D12, Vulkan), and Metal

DJ: Did you suggest we remove sampleMask or make the device advertise
whether it supports it?

DM: wants to find some samples using it he can share
-

DJ: do they inject them into the Metal shader?
-

Would be great if we had someone from Unity here…
-

MM: why should this be an API construct and not something the shader
authors put in?
-

CW: Probably supported by fixed function on some hardware
-

DM: Reduce the amount of data written back to VRAM.
-

CW: let’s tag SampleMask as something we need more data on and get back
to it later
-

RC: My understanding is that it only applies to MSAA workflows. I don’t
think we can work around it in the pixel shader.
-

MM: Depends if the pixel shader has access to the right builtins. Know
GLSL has it.
-

DM: There’s a scenario where you want the shader to run on sample
frequency - the pxiel shader can be run per-pixel, or per-sample. Setting
the mask would allow the hardware to skip some fragment invocations.

MM: Which backends support pixel shader per sample?
-

DM: All of them, will double check. Vulkan supports very configurable
shading. D3D12 does per-sample shading if the shader uses one of the
relevant builtins.
-

CW: feel we’re ratholing a bit. Either get more data and info on how
people deal with lack of sample mask on Metal, or just exclude it from the
MVP and add it later. Suggest postponing it.
-

DJ: Can ask the Metal team what the reason for leaving it is.
-

DM: I think other than this, we have a pretty good picture of the states
for the MVP.
-

CW: in Vulkan, the primitive type has to be set on the pipeline state
whereas in other API it is just triangles vs. line vs. point.
-

DM: Yes, I don’t think we have an alternate choice.

Render targets / Render passes

CW: Have people had a chance to look at the documentation on Vulkan
Render Passes?
-

RC: I have looked at the github issue.
-

https://github.com/gpuweb/gpuweb/issues/23
-

MM: did read the relevant chapter in Graham Sellers’ Vulkan book

CW: think we need something at least like Vulkan’s renderpasses
-

Two additional things in Vulkan:
-

More explicit dependencies between rendering operations
-

Input attachment: say you’re going to sample a texture at the same
location as the pixel location you’re rendering
-

Allows keeping data in tile memory; this is hugely important for
mobile
-

CW: I am a fan of the concept of renderpasses. I’m not sold on
everything that Vulkan does, but there might be good reasons for their
design.
-

RC: As long as we can emulate them on APIs that don’t have them, I’m ok
with it.
-

CW: emulation would be to use it as a texture (it’s free)
-

Instead of making something an input attachment, you’d make it a
target
-

The input attachment operation in the fragment shader would be a
texel fetch
-

RC: So you’d do the pass/for-loop yourself in the implementation?
-

CW: input attachments: you have one rendering pass with an output
attachment
-

Then transition to input attachment (G-Buffer, lighting, …)
-

In D3D: that’s a sampled texture (SRV), use TexelFetch or similar
-

It’s a different function call in SPIR-V that is an offset from the
current texture position (and current hardware only supports a (0, 0)
offset)
-

DM: Downside is that it complicates the API for users and for
specification writers. But it is difficult to put it post MVP because it
affects things like the definition of pipelines.
-

OpenGL has a tiled memory extension too.
EXT_shader_pixel_local_storage
<https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_shader_pixel_local_storage.txt>
-

KD: If we can design it as an opt-in feature, that can first not be
used, then use render passes to optimize things, it would be great.
-

CW: If you only have one render sub-pass, it’s equivalent to only having
one pass or Metal’s approach. It’s just a bit more verbose.
-

KD: Basically it would possible to start with one renderpass with one
subpass and then split things?
-

CW: usually would start with a monolithic one and split it up into
smaller ones with dependencies
-

Seems we need more thought
-

DM: Have you emulated this in NXT yet?
-

CW: yes. Transform it into a bunch of Metal-style render encoders.
-

KN: Don’t have input attachments yet. Should be easy.
-

CW: Have RenderPass objects, but they’re only a placeholder for later.
-

MM: design sounds fine to us
-

CW: Great.

Dependency Tracking and Undefined Behaviour

MM: The reason I raised this in the last meeting is that they seem to
express dependencies and if you get them wrong, then your rendering might
be broken.
-

MM: One of the reasons the Web is good is because it works the same
everywhere (ideally). If you get synchronization wrong on some backends and
devices, it might work in some places and not others. So if you ship
something with UB, then your customer can say “wait this looks wrong”.
-

MM: We need to make it very difficult to create a WebGPU program that
has undefined behaviour.
-

CW: generally agree. D3D, if you use the debug layers, forces you to do
the right barriers.
-

Either an “or” of read state, or 1 bit of write state.
-

Implicitly does the right memory barriers behind the scene.
-

Think that using this kind of resource tracking, can get rid of most
undefined behavior due to memory barriers.
-

This is what we’ve been doing in NXT.
-

Graphics/compute interop sample had usage transitions, and it “just
worked” on D3D. Memory tracking is just 50 lines of code in the
D3D backend.
-

Big fan of usage transitions like this.
-

(D3D doesn’t have a spec, so D3D’s debug mode shows the correct
usage.)
-

DJ: We could have a WebGPU debug mode that tells the content that it’s
done something wrong.
-

CW: we’re saying that we have this sort of (D3D debug mode)-like
tracking already in NXT on all the time, and it’s working fine.
-

MM: what you’re saying is basically what we were about to propose
-

Best way to eliminate this undefined behavior is to have this sort of
state tracking in the browser
-

DJ: Whether it does it via a native state tracking or not is fine.
-

CW: agree. NXT is based around the assumption that doing this state
tracking is fast, and not too constraining for the application. NXT design
doc has a section on this (TODO(cwallez): add link).
-

MM: One other point is: if we are going to do the state tracking, if the
user tries to use a resource that isn’t in the correct state, the browser
transitions it.
-

CW: either we do the transitions implicitly, when user uses resource in
a different way, or do it explicitly. NXT asks user to do it explicitly.
“Command buffer, transition this buffer to this usage”.
-

MM: if you’re going to do all the tracking, why not just issue the
correct barriers?
-

JG: one of the common complaints: sometimes the user doesn’t know what
state tracking they did wrong. Also, forcing the user to say “this is th
estate tracking i’m doing”.
-

So we say “we only do the transitions you ask us to do”, so you know
where memory barriers are done.
-

MM: it’s easier for authors if we do the right thing
-

JG: harder for authors to get performant code
-

CW: explicit usage transitions allow you to bulk them together. Results
in only one D3D memory barrier operation, so only one GPU “WaitForIdle”
instead of ten of them.
-

DJ: We should get feedback from the Metal driver the impact of having
implicit barriers
-

DM: Not optimistic about automatic tracking, if you have multiple queues
in which you submit to different queues and there’s synchronization with
semaphore. Doing automatic tracking on the CPU side becomes hard.
-

DM: Metal has less synchronization features so it makes the CPU tracking
easier in Metal.
-

JG: Amongst complaints about OpenGL is that it is hard to know the
memory barriers that happen because they are implicit. They being explicit
are an advantage of the explicit APIs.
-

MM: this is the point that Dean just made: two of the APIs expose these
transitions, but Metal’s successful (JG: in its own goals), without doing so
-

CW: Metal has to run on fewer platforms, so either you run on dGPU or
mobile GPU designed by Apple. Point JG’s trying to make: it’s not because
Metal was able to get this working for a limited number of GPUs that we
should be able to do this on D3D or Vulkan.
-

DJ: you’re confident that D3D and Vulkan backends will be able to run
performantly even though they’re doing this state tracking themselves?
-

CW: suggesting that all the usage tracking is done in NXT. Not relying
on the debug mode of the API. CPU cost of doing implicit tracking == CPU
cost of validating things are done in the right order. Might as well be
explicit because it has a performance advantage (not on Metal, though).
-

DJ: think it will simplify the API by doing it for the author.
-

MM: Metal runs on Macs and Macs use off-the-shelf GPUs, and Metal also
runs on phones. GPUs are close enough to GPUs of other APIs that Metal
would have a big advantage on memory barriers.
-

MM: theoretically possible to get an old tower Mac Pro and dual-boot it
into macOS and Windows
-

CW: can MM ask the Metal team how they do the barriers? Do they issue it
at the last moment when they see the resource is being used in a different
way? Or do they parse the command buffer and try to coalesce the memory
barriers?
-

DJ/MM: think it’s the latter, we’ll ask them. But even if they tell us
the answer we might not be able to repeat it. And it might be just an
implementation detail (different on different drivers.)
-

KD: No matter which implicit synchronization the API will do, we’ll need
to specify it clearly in the spec so application can predict where memory
barriers will be inserted.
-

CW: if we have implicit memory barriers, disagree that it should be
specified where they’re inserted because it’s an implementation detail.
Might depend on the backend which way’s the most efficient.
-

DJ: More important that things work consistently vs. having the spec say
where barriers happen. On the Web reproducibility is more important than
perf.
-

KD: if you have interactive content performance is part of the result
-

DJ: agree, but interoperability more important than performance
-

JG: true in a long term sense but in the short term interop is only
guaranteed by using something like WebGL
-

DJ: was lucky with WebGL because behavior could be defined well, and
because of large amount of work in interoperability tests
-

KD: my experience is that WebGL sometimes doesn’t work in some browser
or another
-

JG: think WebGL will behave more consistently than WebGPU for a few years
-

TO: how is that relevant?
-

JG: depends on what we’re going to do with this API

Agenda for next meeting

Memory barriers
-