Minutes for the 2017-12-06 meeting

GPU Web 2017-12-06

Chair: Corentin

Scribe: Ken

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/12JQDw8gNE2MslMWRcK5sjZ3CsafhzIRZ6RZDyX-8MZk>
TL;DR

   -

   Status updates
   -

      Apple: posted a Secure HLSL
      <https://github.com/gpuweb/gpuweb/wiki/Secure-HLSL-Explainer> doc
      -

      Google: posted a UMA buffer mapping doc discussed in this meeting
      -

      Mozilla: specialization constants in gfx-rs
      -

   Buffer Mapping and UMA doc
   <https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg/edit#heading=h.1qqspik2b22p>
   -

      Google proposing alternative 2, a UMA extension
      -

      Discussion of alternatives to callbacks for synchronization:
      promises, or synchounous non-blocking queries.
      -

         WEBGL_get_buffer_sub_data_async discussions don’t bridge to WebGPU
         because the structure of the APIs is different.
         -

         Should ask ISV input to know what’s most useful to them.
         -

      Persistently mapped buffer (PMB) is important to applications. The
      proposal doesn’t expose them directly
      -

         Concern that PMBs wouldn’t be device independent.
         -

         One more discussion of device independence vs. best performance.
         Suggestion we have opt-in device dependent behavior.
         -

         Do we want to be able to map-write to half a buffer while the
         other half is in use?

Tentative agenda

   -

   Buffer mapping and UMA
   -

   Agenda for next meeting

Attendance

   -

   Apple
   -

   Dean Jackson
   -

   Myles C. Maxfield
   -

   Theresa O'Connor
   -

   Google
   -

      Corentin Wallez
      -

      Victor Miura
      -

      Kai Ninomiya
      -

      Ken Russell
      -

      Victor Miura
      -

   Microsoft
   -

      Rafael Cintron
      -

   Mozilla
   -

      Dzmitry Malyshau
      -

      Jeff Gilbert
      -

      Markus Siglreithmaier
      -

   Yandex
   -

      Kirill Dmitrenko
      -

   Elviss Strazdiņš

Buffer mapping and UMA

Google’s buffer mapping doc with UMA alternative
<https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg/edit#heading=h.1qqspik2b22p>


   -

   CW: For UMA vs. discrete GPUs:
   -

      They’re contrary to one another. The thing to be efficient on UMA is
      different from discrete GPUs.
      -

      Want one code path that is efficient on both
      -

      Most efficient: either have two code paths, or have all resources be
      mappable
      -

      Creation flag whether a resource is mappable or not
      -

      Either: UMA extension that allows mapping resources, or implicit
      staging on the discrete GPU.
      -

      There was a third alternative: “I want to be able to map this and
      maybe it’ll be staged”, but that was strictly inferior.
      -

   MM: upon reading the 3 options was wondering: seemed all of the
   affordances for avoiding CPU-GPU data races had been removed from the UMA
   code, is that true?
   -

      CW: no, it’s the map ext async mechanism that was presented at the
      start of the document.
      -

      It’s a question of what’s allowed on what type of resource
      -

      Alternative 3: all resources can do anything. On UMA, can say “Map
      Write Async”. Get pointer. On UMA, writes directly to resource. Callback
      mechanism ensures no data races. On discrete, gives you back a staging
      buffer instead. When you unmap, it does a copy.
      -

      Is it clear this still prevents data races?
      -

      JG: why are we using async?
      -

      MM: because the GPU might be using the resource at the time the CPU
      wants it
      -

      CW: in NXT we want to eliminate data races
      -

      JG: usually have read/write exclusion. Don’t need async for that.
      -

      CW: yes, we could tell the application that you have to transition to
      MAP_READ usage and then wait for resource to be no longer used, then map
      synchronously. Would return if can’t guarantee simultaneous use.
      -

      Then app wouldn’t have to guess when WebGPU thinks it’s safe to write
      the resource. That’s why it’s async.
      -

      JG: callbacks are traditionally tricky to put into an API because you
      don’t know what’s on the call stack when the callback’s invoked.
      -

      MM: we get to design this
      -

      JG: it’s simpler if you don’t have these things.
      -

      CW: what about a Promise?
      -

      JG: same thing, more steps. Seems unnecessary. If you’re writing data
      out to the GPU, you WRITE_DISCARD something, write into it, and
send it off
      to be committed into the GPU resources. In particular, why make writes
      async?
      -

      TOC: it’s unusual on the web for things to not be async
      -

      JG: it’s unusual to do low-level graphics APIs on the web
      -

      CW: there’s also MapWriteSync. Pipeline the write if necessary,
      otherwise schedule the write now.
      -

      To be race-free, need writeback stuff only when it’s in use by the
      GPU. How does the app know the resource isn’t in use by the GPU any more?
      -

      Can have a polling loop. But the callback tells you when it’s ready.
      Understood that callbacks aren’t preferred.
      -

      JG: think they’re not necessary. In investigation for WebGL’s
      GetBufferSubDataAsync, didn’t find any advantage for doing any writes
      async. If you want to use the same shmem and a deep enough pipeline that
      you want to rewrite into the shmem before it’s committed…
      -

      MM: this is one of the general discussions we’ve had between ease of
      implementability vs. use. From web’s standpoint it’s better to
not have to
      ask every frame “is this resource ready”?
      -

      KD: if you have async then you’re typically forced to wait for the
      callback, put the operation into a queue, and schedule it. If
you’re in the
      middle of an animation you don’t want to do expensive writes.
There will be
      some scheduling around it anyway. Most applications carefully schedule
      expensive operations like writes.
      -

      VM: wouldn’t an application have to use fences/queries to ensure a
      buffer is writable?
      -

      JG: for reading yes, but for writing no. Say you update skinning data
      by splatting the data into a buffer. Next frame, you send a new update of
      the skinning data to the GPU. If it’s not ready to be written you would
      stall.
      -

      VM: you would use multi-buffer to ensure you don’t stall, and use
      some sort of synchronization.
      -

      RC: think Kirill is on the right track. In a typical web app you
      render in response to requestAnimationFrame. When rAF comes, you
ask “which
      buffers can I use?” and check my own data buffers to figure out which can
      be used. Would use multiple buffers and put them on some list when their
      callbacks are called. Not much different from exposing to the user “this
      buffer is ready/not ready”.
      -

      MM: agree with Rafael
      -

      CW: agree with everything except that the app could do the same with
      fences, etc. Question is: if app doesn’t put in fences at all, puts a lot
      of pressure on the implementation to order things properly. Guaranteeing
      e.g. “all buffers before this fence are now ready”.
      -

      RC: just saying that a callback thing is worse or the same amount of
      effort as their managing their fences. Or we have to arrange our
own fences
      so that we have enough bookkeeping to answer the user’s question. In
      general, against the use of callbacks. Do enough bookkeeping to
tell users
      the current state of things.
      -

      MM: that bookkeeping you describe is exactly that infrastructure in
      Promises.
      -

      JG: would have to track which fences have passed, ... , and is the
      buffer in use / ready to write to. Orthogonal to promises/callbacks/etc.
      Track which fences have passed. Be able to tell if this buffer is ready.
      Can: trigger callback, set a boolean, trigger a promise.
      -

      MM: this sync problem needs to happen somewhere. The browser can do
      it just as well as the app.
      -

      JG: I found that using a synchronous command structure worked just as
      well.
      -

      KR: Things are not about getBufferSubDataAsync. We agree your
      proposal would be possible to do. We are concerned about how
easy it is to
      fall off the fast path. For the readback case you need to do the readback
      and wait for a fence to pass before you can read the data. We would like
      things to be more explicit than that with an open query for the readback,
      found it makes the tracking easier in our implementation.
      -

      JG: Appreciate the feedback but GL is a well ordered API so for GL
      using fence should be ok.
      -

      KR: In our implementation it makes things harder with the ordering of
      fences and have to delay things.
      -

      CW: Let’s move on because a lot of these details are very WebGL
      specific.
      -

      KR: The question is whether we should do async vs. sync for WebGPU.
      For developer convenience it seems we should follow the native explicit
      APIs and have a synchronous API.
      -

      VM: may want to talk with ISVs about what they expect. They probably
      have a synchronous pipeline. They may use async callbacks...
      -

      CW: ISVs will probably have strong opinions about this. Was assuming
      that async was best for the web, though sync might work. Whether you use
      callbacks, syncs, promises, etc. – can solve these differences –
are there
      comments on the UMA proposal?
      -

      RC: to what degree should we hold the developer’s hand and make sure
      they can’t screw it up? Or if they screw it up and things don’t
look good,
      is that OK? Should we go the extra mile to do validation so they
can’t mess
      up?
      -

      MM: we should try very hard to make it very difficult for users to
      shoot themselves in the foot. Unless I’ve misunderstood, should be
      impossible for API to block the main thread. A sync API would
cause the API
      to block, right?
      -

      CW: no, the sync version would be “can I map you now”? And if you try
      to map while the answer is “no”, you get null back and an error. All
      non-blocking.
      -

      CW: agree the API should prevent shooting yourself in the foot. Can
      see the difference between discrete GPU and UMA arch. On
discrete GPU your
      app might work b/c your GPU has caches which hide things. On UMA, your
      writes are directly visible to the GPU, might be broken on UMA
only. If you
      have a persistently mapped buffer that you can write to while the GPU’s
      using it, hard or impossible to write a validation layer.
      -

      DM: can’t we prevent persistent mapping in general?
      -

      MM: yes, if you can keep something mapped for 1000 frames, you’ll run
      into sync problems on some drivers.
      -

      DM: it’s a really powerful use case for some users. Not something
      that can be discarded trivially. Games will like to update parts of the
      texture atlas while others are in use.
      -

      MM: there’s a general principle that the web’s device independent. If
      you can write an app that works on your device but broken on all others
      that’s bad.
      -

      JG: Already made the Web device dependent
      -

      KR: Not true, in WebGL we got rid of MapBuffer because in Chromium it
      is not possible to have these synchronous APIs. A lot of work in WebGL is
      to make the API device independent.
      -

      JG: For example different limits in WebGL.
      -

      CW: I think we should learn from WebGL. If you want something above
      the limits you should have to ask for it. It’s a different discussion.
      -

      VM: Strongly opposed to fragmentation. Need to learn from the lessons
      we’ve learned. The more we can ensure write-once, run-everywhere across
      browsers and devices, the more this API will see adoption.
      -

      JG: past a certain point WORA doesn’t work for performant
      applications. Have to compromise on this a little bit for best
performance.
      -

      VM: it’s a question of supporting the few or the many. Most
      developers can’t test on all devices. Strongly feel we should enable apps
      to work by default.
      -

      JG: yes. And if they use Three.js Version 2.0, they will. Don’t want
      to do too much handholding because the ecosystem will do a
better job than
      we can.
      -

      CW: the goal isn’t to make an API that’s easy to use. Even if you’re
      an expert or a beginner, if you make an app that doesn’t ask for
extensions
      or limits, it should work on another machine.
      -

      JG: agree.
      -

      CW: not sure that the app will be optimally performant, but it will
      work. Goal is to make it opt-in to be device dependent.
      -

      VM: we’re making user agents here. Try to prevent bad experiences for
      users. Shouldn’t be just up to the developer to “not mess it up”.
      -

      JG: strongly caution against going too far in that regard. Preventing
      screw-ups also prevent well-meaning developers from doing a good
      engineering job.
      -

      CW: goal is to have low-level, deterministic primitives.
      -

      CW: mapping a buffer persistently is scary from this standpoint.
      Opting the developer in to races.
      -

      JG: only if you allow multiple writes at the same time. Can we have a
      single-writer API? Read-write mutually exclusive.
      -

      MM: thought the problem was CPU vs. GPU collisions.
      -

      VM: this was a use case requested earlier. Writing to half a buffer
      while the other half’s in use. Is this something we want?
      -

      MM: maybe can have a system that makes this well defined. Partition
      buffers?
      -

      KN: think it’s not that difficult to make something that works this
      way but might have performance penalties.
      -

      CW: can make sub-buffers and persistently map a larger buffer. Can
      get in the way of robust buffer access checks done by the hardware. Can
      round-robin between multiple buffers.
      -

      MM: sounds like there’s a persistent mapping but it’s in the
      implementation?
      -

      CW: yes. Either you can map the whole buffer at once, can’t really do
      a ring buffer (instead ring buffer of buffers). But there’s an
API “I want
      to update this sub-part of a buffer now”. Does a copy.
      -

      RC: so in your implementation the “buffer” the user sees  is a window
      into a bigger API buffer?
      -

      CW: no, two mechanisms. Can update a buffer by mapping it – the
      discussion earlier– only map the whole thing. But, if that
annoys you, you
      can say “I want to put this data at that offset in this buffer”.
      Essentially BufferSubData.
      -

      RC: does it let you do that while it’s in use in the GPU?
      -

      CW: yes. To do that you need the buffer in the TRANSFER_DESTINATION
      usage.
      -

      MM: agree with general tenets of this approach - of preventing you
      from writing into buffers that are in use by the GPU. Need to look at
      general mechanism in more detail.
      -

      RC: will our code for doing this automatically do better than code
      written by an application that knows what its intent is?
      -

      MM: depends on what you mean by “better”. Portable? definitely. If
      you’re writing for one specific device might be able to do better.
      -

      RC: meaning that the web developer might need fewer fences, group
      things using fewer primitives.
      -

      VM: interesting presentation from AMD: if you want the optimal thing
      might have to present a graph to the API. Neither the driver nor the
      developer can do the best thing for each architecture.
      -

      JG: how much of that graph is what we get from submitting command
      buffers with dependencies?
      -

      CW: let’s not get into a task graph API. The industry hasn’t
      converged yet.
      -

      JG: this is a task graph API already.
      -

      CW: NXT is an on-line task graph API. AMD’s proposal was an offline
      plus online task graph API.
      -

      JG: Vulkan’s a task graph API
      -

      KN: AMD was saying the way Vulkan’s task graph API isn’t enough
      – need more higher level stuff.
      -

      VM: need hindsight to do the optimal thing. Vulkan’s very low level.
      Have to expose more and more flags to the developer to do the
right thing.
      Also can’t express future architectures.
      -

      JG: please put a link to that in the minutes
      -

      CW: will try to, not sure it was public.
      -

      RC: if we generalize this and make it so the web developer can’t
      shoot themselves in the foot, will their code be as performant as a
      developer doing it themselves?
      -

      CW: no. The best we can do is the best the developer can do. The
      question is how much we lose. I think we’ll lose very little.
The worst you
      do is a ring buffer of buffers instead of a single buffer.
That’s probably
      the biggest loss. Aside from that, we’re probably doing what any sensible
      app would do.
      -

      CW: while doing NXT’s Vulkan backend, have been doing this buffer
      upload stuff, and have been writing code exactly like what an application
      would do. We end up doing per-resource tracking, while the app could
      plausibly do this at a coarser granularity.
      -

      RC / CW: when is this checking done? Queue submits.
      -

      RC: if we think that we won’t lose much perf, there’s a lot to be
      said for things to be more portable and have fewer bug reports
from people
      about this and it’s their bookkeeping.
      -

      JG: would like to learn more about when this would be the case and
      what guarantees the APIs we’re building on top of have. Would be
good if we
      didn’t have to lock things down too badly.
      -

      KD: can we disallow people from shooting themselves in the foot on
      the web, but allow opt-in performance?
      -

      CW: scary because everyone will opt-in to the high performance unsafe
      mode. But, seems OK to add more knobs to apps that request them.
      -

      CW: goes back to UMA and buffer mapping. That’s why we’d like to have
      an UMA extension which provides that functionality. Opting into footguns
      that would make apps non-portable don’t sound OK.
      -

      JG: you’re saying, better to have performance footguns than
      correctness footguns?
      -

      CW: yes. If you have correctness footguns that give you 1% more perf,
      everyone will use them and the apps will be nonportable.
      -

      JG: for some developers, performance is part of the minimum level
      product. Let’s not throw it out blindly. Should be trying to avoid
      performance cliffs.
      -

   Google / CW: still somewhat happy with our proposal (async vs. sync is a
   problem that we’ll have to decide). Concern about persistent buffers used
   for ring buffers. Doesn’t account for sub-buffer usage. Could be a problem
   for ISVs but aren’t sure. Our points: avoid footguns but have no perf
   impact from avoiding footguns.
   -

      KR: think sync uplodads, async readbacks.
      -

      KN: in NXT, have both async and sync uploads in the design. Think
      async uploads have uses but not sure.
      -

   Apple / MM: should be difficult to make an app where uploads/downloads
   are non-portable. Should be an async API. Should be one code path for UMA
   and discrete. Safety and portability are important and should not be
   sacrificed. Some small perf penalty is worth that correctness.
   -

   Microsoft / RC: if we can make uploads portable and not impact perf then
   we should do that. If we do UMA as an extension, let’s do that. They added
   it to D3D12 because they thought there was a need for it.
   -

   Mozilla / JG: struggling to apply WebGL investigation to these APIs. Big
   Q is what ordering guarantees look like. Hope there’s a good solution that
   keeps these looser ordering guarantees in mind and allows very safe
   read/write exclusion that makes sense and prevent footguns. Think there’s
   not a strong benefit for async uploads. Think sync uploads are just as fast
   and don’t hurt. Readbacks, not so sure, esp. because orderings are looser.
   -

   Mozilla / DM: explicit memory management is one of the biggest
   advantages of the low level APIs. Persistent mapping is a critical feature.
   Would prefer to have only two timelines, the app side and one per hardware
   queue that executes it. This means a sync based API is preferred. (Can I
   map/use this.)

Agenda for next meeting

Next week: shading languages.

Received on Monday, 11 December 2017 23:21:33 UTC