Minutes for the 2017-07-19 meeting

GPU Web 2017-07-19

Chair: Corentin & Dean

Scribe: Dean with help from Ken

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1FupUhxJL7TfzFSgofShxqYvAlW_zHlZphKB8k7Prwzg/edit>Tentative
agenda

   -

   Administrative stuff (if any)


   -

   Individual design and prototype status
   -

   Renderpasses / rendertargets
   -

   Pipeline state details


   -

   Agenda for next meeting

Attendance

   -

   Dean Jackson (Apple)
   -

   Myles C. Maxfield (Apple)
   -

   Theresa O'Connor (Apple)
   -

   Warren Moore (Apple)
   -

   Austin Eng (Google)
   -

   Corentin Wallez (Google)
   -

   Kai Ninomiya (Google)
   -

   Ken Russell (Google)
   -

   Ricardo Cabello (Google)
   -

   Rafael Cintron (Microsoft)
   -

   Dzmitry Malyshau (Mozilla)
   -

   Jeff Gilbert (Mozilla)
   -

   Alex Kluge (Vizit Solutions)
   -

   Kirill Dmitrenko (Yandex)
   -

   Doug Twilleager (ZSpace)
   -

   Elviss Strazdiņš
   -

   Joshua Groves
   -

   Tyler Larson

Administrative items

   -

   CW: I sent email about the Sept Chicago F2F meeting. Please reply to me
   if you’re coming, either in person or by hangouts.
   -

   CW: I’ve also put up an agenda document, that we will fill in before the
   meeting.


   -

   DJ: TPAC, we will meet with WebASM. I’ll coordinate a time and let
   internal-gpu know.

Individual design and prototype status

   -

   CW: Google have spent time on texture to buffer copies and the D3D12
   constraints there. The buffer row pitch we think we have to add to WebGPU
   but in NXT we found a way to not have to add the bufer offset alignment by
   sometimes splitting copies in two.
   -

   DM: Mozilla has looked at Metal and D3D12 backend. And got Vulkan
   descriptor pools mapped to Metal Indirect Argument Buffers. It seems to
   work well, but isn’t strongly tested.
   -

   DJ: IAB will only work in High Sierra and above so we might need to
   fallback to previous binding model before that.
   -

   DM: In our prototype it is compile time flag that allows choosing
   between IAB and “old Metal”.
   -

   MM: One difference is that IABs can be written by the GPU on a subset of
   the hardware. If you are talking about GPU filling then that’s another
   approach. Vulkan does not have this, but D3D12 does.
   -

   CW: I think this is very advanced and we shouldn’t look at it for now in
   WebGPU.
   -

   MM: If we’re talking about CPU binding, then using IABs isn’t always
   necessary.
   -

   CW: We’ve been able to implement the binding model we presented in NXT
   using Metal’s older binding model.
   -

   MM: Yeah either way is fine.

Pipeline state details

   -

   DM and JG have worked on the issue in GitHub.
   -

   https://github.com/gpuweb/gpuweb/issues/26
   -

   CM: This issue highlights the difference between all three APIs.
   -

   DM: It seems pretty clear what the overlap is. Vulkan will need its
   dynamic state capabilities.
   -

   DM: Also need to remove some features the D3D12 doesn’t support like
   TRIANGLE FAN, separate face stencil and mask,...
   -

   DM: Instance rate and Sample mask are the difficult ones. Vulkan doesn’t
   support an instance rate more than 1. Sample mask is not present in Metal.
   -

   DM: For the MVP we could support an instance rate of one.
   -

   CW: What is sample mask?
   -

   DM: When the sample coverage is computed by the rasterizer, it then uses
   the mask to limit the samples you render.
   -

   MM: Metal doesn’t have that concept. We shouldn’t support it.
   -

   DM: Could have the device capabilities expose “bool
   isSampleMaskSupported”
   -

   DM: Didn’t look at tesselation and ..
   -

   MM: Don’t think tesselation is necessary for MVP.
   -

   CW: Especially since it different between (D3D12, Vulkan), and Metal


   -

   DJ: Did you suggest we remove sampleMask or make the device advertise
   whether it supports it?


   -

   DM: wants to find some samples using it he can share
   -

   DJ: do they inject them into the Metal shader?
   -

      Would be great if we had someone from Unity here…
      -

   MM: why should this be an API construct and not something the shader
   authors put in?
   -

   CW: Probably supported by fixed function on some hardware
   -

   DM: Reduce the amount of data written back to VRAM.
   -

   CW: let’s tag SampleMask as something we need more data on and get back
   to it later
   -

   RC: My understanding is that it only applies to MSAA workflows. I don’t
   think we can work around it in the pixel shader.
   -

   MM: Depends if the pixel shader has access to the right builtins. Know
   GLSL has it.
   -

   DM: There’s a scenario where you want the shader to run on sample
   frequency - the pxiel shader can be run per-pixel, or per-sample. Setting
   the mask would allow the hardware to skip some fragment invocations.


   -

   MM: Which backends support pixel shader per sample?
   -

   DM: All of them, will double check. Vulkan supports very configurable
   shading. D3D12 does per-sample shading if the shader uses one of the
   relevant builtins.
   -

   CW: feel we’re ratholing a bit. Either get more data and info on how
   people deal with lack of sample mask on Metal, or just exclude it from the
   MVP and add it later. Suggest postponing it.
   -

   DJ: Can ask the Metal team what the reason for leaving it is.
   -

   DM: I think other than this, we have a pretty good picture of the states
   for the MVP.
   -

   CW: in Vulkan, the primitive type has to be set on the pipeline state
   whereas in other API it is just triangles vs. line vs. point.
   -

   DM: Yes, I don’t think we have an alternate choice.

Render targets / Render passes

   -

   CW: Have people had a chance to look at the documentation on Vulkan
   Render Passes?
   -

   RC: I have looked at the github issue.
   -

   https://github.com/gpuweb/gpuweb/issues/23
   -

   MM: did read the relevant chapter in Graham Sellers’ Vulkan book


   -

   CW: think we need something at least like Vulkan’s renderpasses
   -

      Two additional things in Vulkan:
      -

      More explicit dependencies between rendering operations
      -

      Input attachment: say you’re going to sample a texture at the same
      location as the pixel location you’re rendering
      -

         Allows keeping data in tile memory; this is hugely important for
         mobile
         -

   CW: I am a fan of the concept of renderpasses. I’m not sold on
   everything that Vulkan does, but there might be good reasons for their
   design.
   -

   RC: As long as we can emulate them on APIs that don’t have them, I’m ok
   with it.
   -

   CW: emulation would be to use it as a texture (it’s free)
   -

      Instead of making something an input attachment, you’d make it a
      target
      -

      The input attachment operation in the fragment shader would be a
      texel fetch
      -

   RC: So you’d do the pass/for-loop yourself in the implementation?
   -

   CW: input attachments: you have one rendering pass with an output
   attachment
   -

      Then transition to input attachment (G-Buffer, lighting, …)
      -

      In D3D: that’s a sampled texture (SRV), use TexelFetch or similar
      -

      It’s a different function call in SPIR-V that is an offset from the
      current texture position (and current hardware only supports a (0, 0)
      offset)
      -

   DM: Downside is that it complicates the API for users and for
   specification writers. But it is difficult to put it post MVP because it
   affects things like the definition of pipelines.
   -

      OpenGL has a tiled memory extension too.
      EXT_shader_pixel_local_storage
      <https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_shader_pixel_local_storage.txt>
      -

   KD: If we can design it as an opt-in feature, that can first not be
   used, then use render passes to optimize things, it would be great.
   -

   CW: If you only have one render sub-pass, it’s equivalent to only having
   one pass or Metal’s approach. It’s just a bit more verbose.
   -

   KD: Basically it would possible to start with one renderpass with one
   subpass and then split things?
   -

   CW: usually would start with a monolithic one and split it up into
   smaller ones with dependencies
   -

      Seems we need more thought
      -

   DM: Have you emulated this in NXT yet?
   -

   CW: yes. Transform it into a bunch of Metal-style render encoders.
   -

   KN: Don’t have input attachments yet. Should be easy.
   -

   CW: Have RenderPass objects, but they’re only a placeholder for later.
   -

   MM: design sounds fine to us
   -

   CW: Great.

Dependency Tracking and Undefined Behaviour

   -

   MM: The reason I raised this in the last meeting is that they seem to
   express dependencies and if you get them wrong, then your rendering might
   be broken.
   -

   MM: One of the reasons the Web is good is because it works the same
   everywhere (ideally). If you get synchronization wrong on some backends and
   devices, it might work in some places and not others. So if you ship
   something with UB, then your customer can say “wait this looks wrong”.
   -

   MM: We need to make it very difficult to create a WebGPU program that
   has undefined behaviour.
   -

   CW: generally agree. D3D, if you use the debug layers, forces you to do
   the right barriers.
   -

      Either an “or” of read state, or 1 bit of write state.
      -

      Implicitly does the right memory barriers behind the scene.
      -

      Think that using this kind of resource tracking, can get rid of most
      undefined behavior due to memory barriers.
      -

      This is what we’ve been doing in NXT.
      -

      Graphics/compute interop sample had usage transitions, and it “just
      worked” on D3D. Memory tracking is just 50 lines of code in the
D3D backend.
      -

      Big fan of usage transitions like this.
      -

      (D3D doesn’t have a spec, so D3D’s debug mode shows the correct
      usage.)
      -

   DJ: We could have a WebGPU debug mode that tells the content that it’s
   done something wrong.
   -

   CW: we’re saying that we have this sort of (D3D debug mode)-like
   tracking already in NXT on all the time, and it’s working fine.
   -

   MM: what you’re saying is basically what we were about to propose
   -

      Best way to eliminate this undefined behavior is to have this sort of
      state tracking in the browser
      -

   DJ: Whether it does it via a native state tracking or not is fine.
   -

   CW: agree. NXT is based around the assumption that doing this state
   tracking is fast, and not too constraining for the application. NXT design
   doc has a section on this (TODO(cwallez): add link).
   -

   MM: One other point is: if we are going to do the state tracking, if the
   user tries to use a resource that isn’t in the correct state, the browser
   transitions it.
   -

   CW: either we do the transitions implicitly, when user uses resource in
   a different way, or do it explicitly. NXT asks user to do it explicitly.
   “Command buffer, transition this buffer to this usage”.
   -

   MM: if you’re going to do all the tracking, why not just issue the
   correct barriers?
   -

   JG: one of the common complaints: sometimes the user doesn’t know what
   state tracking they did wrong. Also, forcing the user to say “this is th
   estate tracking i’m doing”.
   -

      So we say “we only do the transitions you ask us to do”, so you know
      where memory barriers are done.
      -

   MM: it’s easier for authors if we do the right thing
   -

   JG: harder for authors to get performant code
   -

   CW: explicit usage transitions allow you to bulk them together. Results
   in only one D3D memory barrier operation, so only one GPU “WaitForIdle”
   instead of ten of them.
   -

   DJ: We should get feedback from the Metal driver the impact of having
   implicit barriers
   -

   DM: Not optimistic about automatic tracking, if you have multiple queues
   in which you submit to different queues and there’s synchronization with
   semaphore. Doing automatic tracking on the CPU side becomes hard.
   -

   DM: Metal has less synchronization features so it makes the CPU tracking
   easier in Metal.
   -

   JG: Amongst complaints about OpenGL is that it is hard to know the
   memory barriers that happen because they are implicit. They being explicit
   are an advantage of the explicit APIs.
   -

   MM: this is the point that Dean just made: two of the APIs expose these
   transitions, but Metal’s successful (JG: in its own goals), without doing so
   -

   CW: Metal has to run on fewer platforms, so either you run on dGPU or
   mobile GPU designed by Apple. Point JG’s trying to make: it’s not because
   Metal was able to get this working for a limited number of GPUs that we
   should be able to do this on D3D or Vulkan.
   -

   DJ: you’re confident that D3D and Vulkan backends will be able to run
   performantly even though they’re doing this state tracking themselves?
   -

   CW: suggesting that all the usage tracking is done in NXT. Not relying
   on the debug mode of the API. CPU cost of doing implicit tracking == CPU
   cost of validating things are done in the right order. Might as well be
   explicit because it has a performance advantage (not on Metal, though).
   -

   DJ: think it will simplify the API by doing it for the author.
   -

   MM: Metal runs on Macs and Macs use off-the-shelf GPUs, and Metal also
   runs on phones. GPUs are close enough to GPUs of other APIs that Metal
   would have a big advantage on memory barriers.
   -

   MM: theoretically possible to get an old tower Mac Pro and dual-boot it
   into macOS and Windows
   -

   CW: can MM ask the Metal team how they do the barriers? Do they issue it
   at the last moment when they see the resource is being used in a different
   way? Or do they parse the command buffer and try to coalesce the memory
   barriers?
   -

   DJ/MM: think it’s the latter, we’ll ask them. But even if they tell us
   the answer we might not be able to repeat it. And it might be just an
   implementation detail (different on different drivers.)
   -

   KD: No matter which implicit synchronization the API will do, we’ll need
   to specify it clearly in the spec so application can predict where memory
   barriers will be inserted.
   -

   CW: if we have implicit memory barriers, disagree that it should be
   specified where they’re inserted because it’s an implementation detail.
   Might depend on the backend which way’s the most efficient.
   -

   DJ: More important that things work consistently vs. having the spec say
   where barriers happen. On the Web reproducibility is more important than
   perf.
   -

   KD: if you have interactive content performance is part of the result
   -

   DJ: agree, but interoperability more important than performance
   -

   JG: true in a long term sense but in the short term interop is only
   guaranteed by using something like WebGL
   -

   DJ: was lucky with WebGL because behavior could be defined well, and
   because of large amount of work in interoperability tests
   -

   KD: my experience is that WebGL sometimes doesn’t work in some browser
   or another
   -

   JG: think WebGL will behave more consistently than WebGPU for a few years
   -

   TO: how is that relevant?
   -

   JG: depends on what we’re going to do with this API

Agenda for next meeting

   -

   Memory barriers
   -

   More on render passes.
   -

   Dean to chair next meeting


   -

   In three weeks talk about shaders (second week of August, the 9th).

Received on Friday, 21 July 2017 20:46:40 UTC