Minutes for the 2017-11-15 meeting from Corentin Wallez on 2017-11-28 (public-gpu@w3.org from November 2017)

From: Corentin Wallez <cwallez@google.com>
Date: Tue, 28 Nov 2017 15:23:22 -0500
To: public-gpu <public-gpu@w3.org>
Message-ID: <CAGdfWNMm9BK16fi06eFY=qd+Q61-RTxaCtxeB33h6z-FP0t1ig@mail.gmail.com>
GPU Web 2017-11-15

Chair: Dean

Scribe: Ken w/ help from Kai

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1oI7_SbWYs2bO0D4NqjOd8L7wAqNAswZa2nDfE1xmy9M>
TL;DR

   -

   Recap of the WASM CG meeting.
   -

   Status updates:
   -

      Apple: Has codegen from Secure HLSL to SPIR-V
      -

      Google: Wrote docs about NXT memory barriers and buffer mapping.
      -

   Resource upload/mapping
   -

      Goal of NXT’s design is to minimize copies. Explanation of the
      transition concept.
      -

      Having resources being mapped is the best way to minimize copies in
      UMA.
      -

      Discussion of having all resources be stageable vs. requiring a
      creation flag. Become discussion of hints vs. explicit parameters.
      -

      NXT’s design uses a callback to signal when the mapped buffer is
      ready, discussion whether this is necessary to avoid CPU-GPU races.
      -

   No meeting next week.

Tentative agenda

   -

   Administrative stuff (if any)


   -

   Individual design and prototype status


   -

   Resource upload / mapping
   -

   Agenda for next meeting

Attendance

   -

   Apple
   -

   Dean Jackson
   -

   Myles C. Maxfield
   -

   Google
   -

      John Kessenich
      -

      Kai Ninomiya
      -

      Ken Russell
      -

   Microsoft
   -

      Chas Boyd
      -

      Rafael Cintron
      -

   Mozilla
   -

      Dzmitry Malyshau
      -

      Jeff Gilbert
      -

   Yandex
   -

      Kirill Dmitrenko
      -

   Elviss Strazdiņš
   -

   Markus Siglreithmaier

Administrative items

   -

   Later: brief recap of TPAC meeting
   -

   License: Dean: still no update from legal folks
   -

      Microsoft: no updates
      -

   Next week’s Thanksgiving in the USA; ~half group will be on vacation
   -

      Will skip next week’s meeting
      -

   Myles: we’ve also been on a particular cadence about shading languages
   -

      Topic for next discussion (in two weeks) will be shading languages
      -

   Myles: brief recap of TPAC meeting:
   -

      ~4 broad topics covered:
      -

      How an object model would look in WebAssembly
      -

         WebAssembly today has 4 types. Two integer and two floating-point
         types. No native support for a “WebGPUDevice”.
         -

         How would opaque types be handled?
         -

         Identified by number to the application. When WASM calls a WebGPU
         function, passes the number, and the bindings look up in a
table and pass
         the object to an implementation function.
         -

         Kai: Tables concept exists already. Right now there’s exactly one
         table. Idea is to add multiple tables per type for WebGPUTexture,
         WebGPUBuffer, etc.
         -

      How to do threading
      -

         In general: want the option to have threading in the WebGPU API
         -

         Kai: they’re close (or already have) SharedArrayBuffer shared
         among threads including atomics
         -

         Kai: WASM group said they haven’t thought much about sharing
         opaque objects between threads. SharedArrayBuffer is the only
thing that
         can be shared concurrently between web workers. Transferables
allow objects
         to be given between web workers.
         -

         We might want to add more “shareable” objects.
         -

      Property bags / extensions / expandability
      -

         Have structs / fields / know where these lie
         -

         In Chicago, agreed there should be a way to do extensions
         -

         Probably don’t want a model where a struct is as big as every
         possible extension; probably want a more abstract “property bag” or
         “dictionary” API where the WASM author specifies they want these few
         extensions.
         -

         WASM doesn’t have any affordances for this right now but would be
         happy to investigate
         -

         Ken: questions about how many primitives really need to be added
         or whether much of this can be done in the bindings
         -

         MM: talked about this. Discussed 2 models, one which was a new
         primitive, and the other which was to wrap the property bags
inside opaque
         objects. Make calls to set/get things.
         -

         Kai: Like a C interface to JavaScript API. Emscripten already does
         this for WebGL. The main issue is performance.
         -

      Buffer mapping
      -

         Would be great if a WASM program could load/store GPU memory using
         wasm load/store ops, not function calls. Talked about it,
have some ideas,
         but nothing exists yet.
         -

         Two models: special ArrayBuffer, where Map is an ArrayBuffer and
         loads/stores are function calls. Second: WebAssembly could mmap() and
         replace region of WASM heap with the mmap’ed GPU memory.
         -

      DM: Q about property bags:
      -

         Vulkan has a standard struct and a pointer at the end to chain
         extensions together. If we make the property bag be a field
at the end of
         the standard struct, while basics are regular fields, can
defer solution of
         property bags until after MVP. Could consider this one of the choices.
         -

         In Mozilla’s WebGPU prototype, all objects have an ID; they’re a
         bag on the GPU process side and can’t represent them on the
client side.
         Would be nice to avoid double interaction of IDs, though it’s an impl
         detail from the user’s point of view.
         -

            MM: correct that this is an impl detail. We shouldn’t impose
            restrictions on browsers that don’t use that design. Could
be security
            issues there, for example if have one GPU process and all
textures are
            represented in a single array, because have to ensure that
can’t see other
            pages’ textures.

Individual design and prototype status

   -

   Apple
   -

      MM: now have a SPIR-V codegen phase for JavaScript compiler (from
      Secure HLSL)! Not fully featured but has compiled at least one shader.
      -

      This SPIR-V works in Apple’s example API structure for WebGPU that’s
      checked in to the WebKit repo, implemented on top of Vulkan!
      -

      Now can use the best shading language in the world with the best API
      structure in the world. :) :P
      -

      At some point will implement a Metal backend for this library.
      -

   Google
   -

      CW: written two documents that are updated views about memory
      barriers and buffer mapping. Wrote them to hopefully use in today’s
      discussion.
      -

      Uploaded a bunch of changes to index formats where they live in the
      pipeline instead of the command buffer
      -

   Microsoft
   -

      RC: no status updates. Digesting emails and documents.
      -

   Mozilla
   -

      DM: nothing substantial

Resource upload / mapping

   -

   Documents from Google:
   -

      Memory barriers
      https://docs.google.com/document/d/1k7lPmxP7M7MMQR4g210lNC5TPwmXCMLgKOQWNiuJxzA
      -

      Buffer Mapping
      https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg
      -

   CW: in NXT the idea for data uploads was to have one path which
   minimizes copies. To do this you have some sort of buffer mapping concept.
   Want to eliminate copies from browser side into the WebGPU visible side.
   -

   So questions: how to make mmapping safe? And how to avoid memory races?
   -

   In NXT, each resource can be in combination of read-only formats, *or*
   in a writeable format. This restriction *eliminates* data races.
   -

   When you map a buffer, that only works when it’s in the correct state.
   It can’t be used for anything else except being mapped. No reading vertices
   from it, etc. This interaction means that NXT is data race free in this
   regard, as well as eliminates copies behind the scenes. Also, GPU won’t
   read from the buffer until you’re done writing it.
   -

   MM: questions:
   -

      1. Docs sounded like a resource could be in a state where it’s
      readable by the CPU and usable by the GPU?
      -

         A: KN: : when you create a resource you specify the allowed
         usages. Then later you specify the current usage.
         -

         If a buffer has the allowed usage MapWrite, the only other valid
         allowed usage is TransferSrc. Then, the current usage of the
buffer, at any
         given time, can only be either MapWrite or TransferSrc.
         -

         CW: probably want buffers to be mappable and more than just
         transfer sources or destinations. Gets tricky because there
are some other
         interactions.
         -

         MM: that was my thought. If the only way to use the feature of
         mapping data on the CPU was to “prepare” a future copy, which
is a command
         enqueued in the command stream, you haven’t reduced the copies.
         -

         CW: there are always two copies. First, when you write the data on
         the CPU: uncompress an image, etc. Then, on a discrete GPU, a
copy from CPU
         visible memory to GPU local memory for best performance.
         -

            (When you map the buffer you write as much as possible into
            GPU-visible memory. Decompressing data into GPU-visible
buffer. Copying
            from CPU&GPU visible memory to GPU local memory.)
            -

            This is the minimal number of copies.
            -

            But: the current design isn’t great for UMA yet. Need to figure
            out better how to make that work.
            -

         MM: Apple’s primary platform is UMA.
         -

         CW: for UMA: have an extension “you’re on UMA”? App would need to
         do the right thing.
         -

         Second option: two types of mapping. (1) Transfer buffer only, app
         has to do a copy; or (2) “staged”: let me map this buffer
even though it’s
         GPU local (incurs extra staging buffer and copy on discrete,
but not on
         UMA).
         -

         Think it’s similar to Apple’s approach where all 3 usages are
         always allowed. Makes things simpler on GPUWeb impl side
because you know
         what things have staging copies or not.
         -

         MM: why not allow all resources to have staging copies?
         -

         CW: it’s less explicit. It doesn’t want all resources to be
         staged. Apps want to have control over this. If we make all resources
         staged then it’s blurry.
         -

         JG: from OpenGL and D3D12, have types of buffers based on heap
         type and/or usage hint during BufferData. Provides an
optimization hint
         about how the resource will behave. You can do whatever you
want. Can map
         upload heaps; it’ll be slow. But it lets the impl optimize
for uploads and
         minimize the number of copies.
         -

         MM: think we should have hints about where things should be.
         -

         CW: hints aren’t great. The app should say “I’m going to use this
         thing this way”. You need to support all paths in the impl
anyway. If the
         app says that it’ll only use a resource in a particular way
then it can
         optimize things much more.
         -

         CW: glBufferData(.., GL_STATIC_DRAW). Driver assumes it lives in
         GPU local memory. But you can really do whatever you want. Many OpenGL
         drivers have heuristics and use the hint to prime the
heuristic. But the
         impl still needs to have usage counters, etc, and
optimizations. A hint
         that doesn’t enforce anything makes things worse. Still have
to handle all
         possible cases as the implementation.
         -

         CW: in ANGLE we have up to 4 copies because it could be used in 4
         different ways. Since it’s just a “hint”, there’s no way to
eliminate this.
         -

         KR: A huge goal of new explicit API is to remove the possibility
         of implicit shadow copies, and give the memory control to the
application.
         -

         JG: appreciate how it can make things easier if you can reliably
         say “I just need a copy of this”, but at the same time the
primitive we
         need is “i have a resource, and it’s either CPU mappable or
not”. Based on
         CPU mappable, use a staging buffer or not. Sophisticated impl
might decide
         more behind the scenes. Also possibility of polyfills: can copy to a
         staging buffer and back, and give you a mmapp’ed copy later. But let’s
         focus on the primitives the hardware needs to give to us.
         -

         MM: reducing the API to the lowest level primitives isn’t always
         the best way to make an API. Many ways to judge an API.
         -

         JG: main idea is that you can do whatever you want if you have
         these primitives. But if you use a higher level primitive
then you force
         this decision on users of the API.
         -

         MM: there’s a difference between letting someone write a library
         and forcing them to do so.
         -

         CW: hope was that the buffer mapping proposal would seem simple
         enough and also provide low level primitives to satisfy both points of
         view. Get the best performance but at the same time it’s easy to use.
         -

            In NXT, we have one extra affordance: sometimes you want to put
            the data in the buffer right now, and not have to
implement ring buffers,
            etc. Proposing: “put this data in this buffer right now”, too.
            -

         MM: might be a solution where both are available in the API. Not
         sure we’ve done enough research to know. Myles would like to
see something
         written about the UMA approach. Think it’s possible to come
to a compromise.
         -

         CW: AI: write proposal to cover UMA and minimizing number of
         copies.
         -

   JG: question about callback mechanism. Why not just use fences and
   signaling?
   -

      KN: NXT doesn’t have fences and signaling yet. :)
      -

      CW: also, with a callback, the buffer only becomes valid once the
      callback is passed. Makes it easier in Chrome where we may need to do
      fences, etc., and where memory may or may not become valid in a certain
      order. Also complexities with multiple queues.
      -

      JG: callbacks add another weird lifetime management issue. What
      thread do they run on, etc.
      -

      CW: it’s a Promise. The only reason it’s a callback is that there’s
      no Promise handling in our code generators yet. Part of the UI that we
      didn’t spend time on.
      -

      RC: how’s the callback implemented?
      -

         CW: Implemented in NXT by, on D3D12, incrementing a fence
         periodically. Once the counter reaches a certain value (the
GPU stream has
         executed past that point), the callback is fired.
         -

         RC: interesting that we just had the discussion about having a
         library vs. making things easier, and here’s a way that NXT’s
making things
         easier.
         -

         RC: D3D12 has a way to set an Event when a GPU instruction is
         passed.
         -

         CW: might be able to do this.
         -

   DM: don’t understand why we’ve made a separate callback mechanism for
   each resource. And different usage transitions, outside the command buffer.
   -

      JG: are callbacks a germane or incidental part of the proposal?
      -

      CW: our proposal does have a statement “you can now read the
      pointer”. But the way it’s said isn’t important. Set a flag, call a
      callback. etc.
      -

      DM’s question was about why there’s a transition at the device level
      / immediate command when other transitions are in command buffer.
      -

      It’s just the map operation that can’t be in the command buffer.
      -

      DM: so you have a separate mechanism for mapping where you could just
      put a fence in the command buffer?
      -

      DM: in D3D12 and Vulkan, you’d wait on a fence that gates executing
      the GPU command buffers. When wait is over, user makes sure that they’ve
      transitioned the resource to a mappable state. Then they map it,
etc. Would
      need some mechanism through fences to call back through the GPU anyway.
      Seems stuff done for buffers is duplicating the mechanism.
      -

      CW: part about transition to MAP_READ state is because we are trying
      to prevent data races (single-write-multiple-read pattern like Rust
      borrowing). Need to flush CPU cache, etc.
      -

      CW: When you submit a command buffer that’s transitioned a bunch of
      buffers to MAP_READ with this serial number on this queue --
when you map,
      we make sure the app can’t see memory before it’s available, for both
      security and portability.
      -

      DM: We are going to have a general mechanism for CPU to call back on
      the GPU. Call map(), etc.
      -

      CW: how will the browser enforce that the memory is up to date for
      the app?
      -

      CW: think we should enforce portability as long as it doesn’t have
      performance impact. This is the same thing the app would do itself, but
      enforce that the app can’t see e.g. uninitialized GPU memory.
      -

      MM: agree with CW: avoiding these data races is important.
      -

      CW: hope these documents help you agree that the performance is the
      same as if the application did it itself.
      -

      DM: have yet to read all documents. May come with follow-on questions
      in email.
      -

   MM: general philosophy is that if the browser can do the same thing the
   web author is doing, the browser should do it, is the best decision.
   -

   DM: concerned about the redundancy in the API.
   -

   KR: Very complicated (in multiprocess browser) to avoid synchronous
   stalls and provide a mapping primitive. Synchronous MapBuffer is
   impossible. Browser either incurs extra copies, or stalls to sync between
   GPU, GPU process, and JS process.
   -

   JG: it’s easier in WebGL because there’s a monotonic queue and a
   before/after effect with fences. Less clear in Vulkan. Think there are ways
   to do this which preserve the Map style API while still allowing remoting
   implementations to do this well, capably and transparently. Relies heavily
   on usage hints or attribute settings to indicate whether a buffer read will
   happen or not, hold on to a shmem you’re going to copy into, holding on to
   staging buffers. Issues about signaling the client. But think it can be
   done.

Agenda for next meeting

   -

   Next week: no meeting (US Thanksgiving).
   -

   In two weeks: shading languages.
   -

      AI: Apple to provide a document on Secure HLSL
      -

      AI: Google to draft a small SPIR-V execution environment for WebGPU;
      how it would be embedded, etc.
      -

   Next non-shader meeting:
   -

      AI: Google to make a proposal for UMA.
Received on Tuesday, 28 November 2017 20:24:09 UTC