Minutes for the 2017-07-26 meeting

GPU Web 2017-07-26

Chair: Dean

Scribe: Ken

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1ULhowm6p576LOL10DViB4Ct2-0qGJIoffWrqnFNHqU0/>Tentative
agenda

   -

   Administrative stuff (if any)


   -

   Individual design and prototype status


   -

   More on render passes.
   -

   Memory barriers
   -

   Agenda for next meeting

Attendance

   -

   Dean Jackson (Apple)
   -

   Julien Chaintron (Apple)
   -

   Myles C. Maxfield (Apple)
   -

   Austin Eng (Google)
   -

   Corentin Wallez (Google)
   -

   Kai Ninomiya (Google)
   -

   Ken Russell (Google)
   -

   Daniel Johnston (Intel)
   -

   Ben Constable (Microsoft)
   -

   Rafael Cintron (Microsoft)
   -

   Dzmitry Malyshau (Mozilla)
   -

   Fernando Serrano (Mozilla)
   -

   Jeff Gilbert (Mozilla)
   -

   Alex Kluge (Vizit Solutions)
   -

   Doug Twilleager (ZSpace)
   -

   Elviss Strazdiņš
   -

   Tyler Larson

Administrative items

   -

   Details on the September meeting:
   https://lists.w3.org/Archives/Member/internal-gpu/2017Jul/0004.html
   -

   Reply to Corentin’s email if you’re going to attend
   -

   CW: Deadline for replying is September 3rd.
   -

   Software license update
   -

      DJ: told by Apple’s legal team; meeting next week with Microsoft,
      Mozilla and Google. Seems to be going well

Individual design and prototype status

   -

   Google
   -

      CW: Kai has been working on making actual SwapChains in NXT
      -

         Can later try to port this on the web front and render to multiple
         canvases with the same device
         -

         Austin has been adding a bunch of fixed function state: primitive
         topology, … .
         -

         Trying to add essential features for a demo or game-like thing.
         -

      DJ: have you done this for WebGL? ImageBitmap, etc.?
      -

         Yes Justin Novosad has been working on this:
         -

            Either render on offscreen canvas, get an imageBitmap and
            transfer it.
            -

            Either give control to the offscreen canvas and render to it.
            Path is currently broken.
            -

         Not sure what the intended model is for rendering to multiple
         GPUWeb canvases.
         -

         In other words, do we need OffscreenCanvas at all for GPUWeb if we
         have SwapChains that can be targeted?
         -

         CW: Discuss at F2F?
         -

            Can ask for a context that’s an “NXT Surface” for a canvas and
            get a swapchain object from it.
            -

         KN: create an NXT device out of thin air, and get SwapChains for
         multiple canvases from it.
         -

            Make it so SwapChains can be transferred to workers and used
            there. Should be like OffscreenCanvas.
            -

         DJ: sounds pretty similar to how Apple’s original prototype
         worked. Made it look like WebGL way of setting things up
(fetch context,
         etc.), and then ask for next texture to render into. Felt a bit weird.
         -

   Mozilla
   -

      DM: lot of internal work in the graphical abstraction layer, but not
      a lot of new features or milestones to report
      -

   Microsoft
   -

      RC: no code contributions yet (to NXT, etc.)
      -

   Apple
   -

      DJ: No news, but started internal discussions on shading languages,
      trying to get info on the sampleMask issue.


CW: JohnK will be discussing shading languages with us in 2 weeks (August
9), so will get info straight from the source.
Render passes

https://github.com/gpuweb/gpuweb/issues/23

   -

   DJ: mostly had consensus on this last week
   -

      What was left? Inheritance? Subpasses?
      -

   CW: people mostly agreed on having “multi-pass” render passes
   -

      Mostly to see if people had more to say after investigating internally
      -

      Tentatively: let’s do multi-pass render passes.
      -

      DJ: fine with Apple.
      -

      RC: did look at Corentin’s presentation. As long as it’s possible to
      implement efficiently on D3D12, it’s fine.
      -

      MM: conceptually render passes are a good idea. Were discussing which
      pieces of state should be part of a render pass.
      -

   DJ: what are next steps?
   -

      Should Corentin or someone define an API? Or move on with the
      assumption that we have an agreement of the overall shape of this part?
      -

   KR / MM: seems too early to define an API
   -

   DM: agree, also tied in to memory barriers
   -

   CW: maybe we can make a comprehensive list of things we want to talk
   about before looking at the shape of the API
   -

   DJ: Could do this on the wiki/github
   -

   MM: an issue per issue, then close them?
   -

   DM: Github milestones?
   -

   KR: spreadsheet?
   -

   DJ: volunteers?
   -

      MM: I can start!

Pipeline state objects

   -

   DJ: do we have complete consensus on pipeline state objects?
   -

   CW: still details like sampleMask
   -

   MM: probably should include depth/stencil state
   -

      Needs to be intersection of the three APIs. Because Metal’s doesn’t
      include this, but WebGPU’s probably should since the other APIs
do have it
      included.
      -

   CW: whatever makes backing APIs do no extra work at run time.
   -

      All the things that are given at pipeline creation state time in the
      underlying APIs should be part of the pipeline state creation in WebGPU.
      -

      Except for Vulkan’s scissor state which is insane.
      -

   DJ: Metal team said they had little feedback from developers about
   needing sampleMask. Think it is fine to leave it out in version one.
   -

      CW: Sounds fine to leave it out, can add it easily later.
      -

   DJ: Trying to get contacts of developers from the Metal team so we can
   talk with them.
   -

   DM: In Vulkan, some pipeline state like separate blend state is gated on
   some features in Vulkan, and might not be available on all hardware. Would
   ofc require some Vulkan features to stay sane.
   -

   DJ: any hardware that doesn’t support it?
   -

   DM: Vulkan has a very rich set of flags for device features
   -

   CW: can look at the Vulkan hardware database and see if a feature is
   available universally, ignoring obsolete hardware
   -

      For example, if ARM doesn’t support a feature, definitely want to
      support it on ARM
      -

      Agreement from Apple, Mozilla

Memory barriers

Continuation from last week’s meeting


   -

   CW: were talking about implicit vs. explicit barriers
   -

      Implicit barriers make it easier on the app, give more leeway to the
      implementation to optimize
      -

      Explicit: give more control to the application to batch memory
      barriers together & have less CPU overhead
      -

   MM: why do they have less CPU overhead?
   -

   CW: if you have implicit ones and want to batch them, then if you’re the
   driver, it’s somewhat easy. You have the command buffer in memory and can
   go backward, say “want this memory barrier to be bigger”.
   -

      But if you’re doing this in an encoding API, you have to walk
      backwards in the command buffer and then walk back forward.
      -

      Have looked at the Intel and AMD open-source Vulkan drivers.
      -

      Barriers in these drivers are just a couple of opcodes. Can edit if
      you submitted them in the past.
      -

      But you can’t do this with any of the backing APIs.
      -

   MM: this is an argument for why the driver’s better at this than the
   browser, not that the browser’s better at this than the author.
   -

   JG: automated batching is not an easy problem.
   -

   CW: Metal can do this more easily
   -

      The argument “Metal does this, so it’s possible” is not a good
      argument because we don’t have as much control over this as the
driver does.
      -

   MM: sounds like somebody would need to try it. Having a fundamental
   disagreement about how difficult this is.
   -

   CW: what do we want to know exactly? Build two prototypes, one with
   implicit and one with explicit transitions?
   -

   DJ: it’s difficult to ask a browser to implement implicit transitions
   and batching to prove that it works…
   -

   MM: at some point you have to check all the resources are in the right
   state, as late as possible. WHy not just issue the barriers there?
   -

   CW: barriers are “flush the whole L1 cache”, for example. Want to
   coalesce them.
   -

   MM: coalescing would happen as late as possible. Checks you would do are
   at the same time that the coalescing would occur.
   -

   CW: disagree. Example: doing some dynamic mesh with compute. Have
   writeable buffers. Then start using as vertex buffers. Barriers have to be
   “inside” the command buffer, not at encoder start/end.
   -

      Analysis sees a draw, looks at needed resources. Sees a vertex
      buffer. Issues a barrier.
      -

      It sees many such draw calls. Can’t do this in the past.
      -

   MM: you have a command buffer. Add as a writeable attachment. Then issue
   a draw. Are these not the times you would both check, and issue a barrier?
   -

   CW: e.g. using 3 writeable vertex buffers
   -

      Use buffer 1: have to issue a barrier.
      -

      Don’t know the future: or have to do an analysis pass first, then
      encode.
      -

   KN:
   -

      (W)rite (T)ransition (R)ead resource 1/2/3
      -

      Batched: W1 W2 W3 T1 T2 T3 R1 R2 R3
      -

      Just-in-time: W1 W2 W3 T1 R1 T2 R2 T3 R3
      -

   MM: would have to do this analysis at commit time
   -

      Don’t think this would be as complicated as CW says
      -

   BC: D3D has had runtimes which did this both implicitly (<= D3D11) and
   explicitly (D3D12)
   -

      The code that does this implicitly is large and complex, in
      particular handling multithreading
      -

      Would have to have basically the same system for Vulkan and D3D12
      -

      Easy to do for the simplistic case. But Corentin and others have
      already pointed out where you’d need a complex system to make
this correct.
      Not over-adding barriers, and never missing one.
      -

      D3D has experience doing this: debug layer. This is a significant
      chunk of work, and don’t have to care about performance.
      -

      Alternate proposal of making the barriers explicit would only require
      Apple to no-op them.
      -

   DJ: what if the barriers were forgotten in Corentin’s example?
   -

      CW: validation error.
      -

   DJ: what about seeing when the user should have put in a barrier but
   didn’t?
   -

   CW: decided to validate resource usages / memory barriers.
   -

      Believe it doesn’t add much cost do some other feature -- like
      destroying textures but still have them around.
      -

      On every command buffer, have to check liveness of textures. At the
      same time, on the same cache line, can validate states.
      -

   KN: have to insert / execute some code to do this validation.
   -

      Think it’s roughly the same amount of time as the just-in-time usage
      transition.
      -

      As we’ve learned from the Intel and AMD driver, this isn’t efficient.
      Have to coalesce. Coalescing is expensive.
      -

   DJ: is the coalescing benefit in Corentin’s example that you would only
   insert a barrier for buffers 1, 2, 3 since you know you’re done with them?
   -

      CW: yes. Rather than: barrier 1, draw 1, barrier 2, draw 2, barrier
      3, draw 3.
      -

   DJ: b/c you’re doing dependency tracking anyway, have to track that
   you’ve put the barrier in for B1, can’t you just do them all?
   -

      CW: no! you don’t know how they’re going to be used in the future.
      And we shouldn’t guess what the hardware does.
      -

   BC: heard people mentioning cache flushes, etc. But another thing that
   GPUs can do is change the compression format based on usage. Can see lots
   of thrashing on some GPUs. On all hardware Ben knows about, these
   transitions are needed.
   -

   DM: like to expand on explicit case. Not just barriers. Also: putting
   things at the end of the usage as a resource, so that you give the hardware
   time to do the transition.
   -

      AMD: use split barriers. Barrier start -> barrier end. Do
      decompression, then flush caches at appropriate time. These sorts of
      optimizations are not easy to automate. But: still easy to validate that
      the transitions are in place.
      -

      Validation becomes increasingly easier than automatically putting
      transitions in place.
      -

   DJ: leaning toward what Ben suggested, which is to go forward with
   explicit barriers, and Metal will no-op them. Would still like to hear from
   major developers using Metal and other APIs what performance hints they’re
   hitting with Metal doing it behind the scenes. Dean assumes that Metal may
   have some code for this but that it’ll never be as optimal as what the
   developer could have done themselves.
   -

   MM: it’s not a no-op. The validation still has to occur. (Agreement)
   -

   CW: there’s CPU overhead on Metal, and there’s mental overhead for the
   developer.
   -

   DJ: Metal wouldn’t have to do anything more than any other
   implementation, just not actually call the barrier.
   -

   MM: DM just gave an example of a particular graphics card. Requiring the
   web author to know the optimization possibilities of all the cards out
   there is not reasonable. Think it would be better for the browser to do a
   “good enough” job.
   -

   DJ: turning it around: it would be terrible if the developer optimized
   for one piece of hardware and it behaved poorly on other hardware.
   -

   BC: have seen developers port things to D3D12. Have seen things they do
   to make things performant pay off on multiple kinds of hardware. Don’t see
   things behave radically differently on Intel, AMD, NVIDIA, etc. In general
   are able to treat all the hardware types roughly the same.
   -

      Even UMA vs. non-UMA devices: there’s no way to do this in a
      performant way with the same set of code.
      -

      You’ll have to test on: tiled/non-tiled, discrete/UMA.
      -

      You have to do this with modern graphics implementations.
      -

      This is why Vulkan/D3D12 exposed the knobs.
      -

   CW: something written with WebGPU should be portable. But we’re talking
   about performance portability right here.
   -

   MM: seems unreasonable to ask every web author to test on 5 different
   platforms.
   -

   CW: look at WebGL right now. NYTimes cares about it running; not the
   last 5% of performance.
   -

   MM: if we only care about a few percent, the browser can just do it
   automatically.
   -

   CW: one case is using Three.js, and you don’t care about the last ounce
   of performance. Compare to Unity/Unreal, which would want coalescing, etc.
   If you do implicit barriers, you prevent the engine authors from optimizing.
   -

   KR: sounds a lot like compiler optimizations like software pipelining,
   inserting prefetches, etc. It’s hard. Seems a lot easier to just validate
   transitions the developer inserted earlier.
   -

   BC: barriers *are* some of the harder things to deal with.
   -

   CW: also, this is an online problem -- affects every command buffer
   submission.
   -

   DJ: have to adjourn discussion now -- continue August 9.

Agenda for next meeting

   -

   Start discussion on shaders - August 9th
   -

   F2F: Swap chain v offscreen canvas

Received on Thursday, 27 July 2017 20:54:32 UTC