Minutes for the 2017-11-01 meeting. from Corentin Wallez on 2017-11-06 (public-gpu@w3.org from November 2017)

From: Corentin Wallez <cwallez@google.com>
Date: Mon, 6 Nov 2017 17:22:08 -0500
To: public-gpu <public-gpu@w3.org>
Message-ID: <CAGdfWNOqefgn3YwbotsHZDOzZp_=kQ4s6NOgeafacWd=2Dr1kg@mail.gmail.com>
GPU Web 2017-11-01

Chair: Dean

Scribe: Dean, Ken

Location: Google Hangout
Minutes from last meeting
<https://docs.google.com/document/d/1HToUA9xTVU8LASHe4PIMmg1n7txvFgX42w_Qt-7iBwI/>
TL;DR

   -

   Meeting with WebAssembly at TPAC.
   -

      Guests attending in person will have to pay a discounted fee, email
      w3t-tpregister@w3.org. There should be a WebEx to attend remotely.
      -

      Will make a short presentation on status of WebGPU then will start
      discussions on several subjects for the interaction of WASM and WebGPU.
      -

      On Wednesday will present a demo at the TPAC demo session.
      -

   Status updates:
   -

      Apple working on SL, and open-sourced a prototype structure for WebGPU
      <https://trac.webkit.org/browser/webkit/trunk/Tools/WebGPUAPIStructure>
      -

      Google: working on TPAC demo, have a GLSL -> SPIR-V compiler in WASM.
      -

   Resource updates
   -

      The API should prevent data races between CPU and GPU.
      -

      Suggestion to have a “host access pass” similar to compute / render
      passes.
      -

         Concern that the approach can’t reach the minimal number of copies.
         -

         Concern that this doesn’t map to D3D12 / Vulkan concepts (mostly
         resolved)
         -

      Problem is similar to WebGL’s getBufferSubDataAsync.
      -

      Discussion whether different memory heaps should be exposed to the
      application.
      -

         Suggestion to make all buffers writable by the CPU and hide
         staging buffers. Would eliminate difference between discrete
GPU and UMA.
         -

         Concern that this would prevent apps from having optimal
         performance.

Tentative agenda

   -

   Administrative stuff (if any)


   -

   Individual design and prototype status


   -

   Resource updates (data upload / download)
   -

   Agenda for next meeting

Attendance

   -

   Apple
   -

   Dean Jackson
   -

   Myles C. Maxfield
   -

   Theresa O'Connor
   -

   Google
   -

      Corentin Wallez
      -

      David Neto
      -

      John Kessenich
      -

      Kai Ninomiya
      -

      Ken Russell
      -

   Microsoft
   -

      Chas Boyd
      -

      Rafael Cintron
      -

   Mozilla
   -

      Dzmitry Malyshau
      -

      Jeff Gilbert
      -

   Yandex
   -

      Kirill Dmitrenko
      -

   Joshua Groves

Administrative items

   -

   TPAC (next week)
   -

      Meeting with WebASM - Tuesday 7th of November from 2PM to 3:30PM
      -

      Vague topic: what we want out of WebASM, what they want as an API
      they can import into their environment
      -

      Requires registration with W3C, but can get discount for single-day.
      -

         Registration form was closed same day the email was sent.
         -

         Email address: w3t-tpregister@w3.org
         -

         They know about the joint meeting and know about the special rate.
         -

      Apple, Google, Mozilla at least in attendance
      -

      Hopefully will have a dial-in, will send out info ASAP, assume Webex.
      -

      Corentin will be giving a demo on the Wednesday, in an
      exhibition-style session, basically showing the content he gave to the
      Khronos meeting in January
      -

      Apple is happy with Google’s demo being the one shown (i.e. Apple
      doesn’t need a spot)
      -

      Mozilla’s Patrick Walton might be able to show their demo.
      -

   No feedback on suitability of the license, from either Apple or
   Mozilla’s legal teams.
   -

      Microsoft heard it would be ready “soon”.

Web Assembly Meeting

   -

   Corentin has emailed a list of topics:
   -

      Fast bindings for Web platform APIs (call overhead and GC-less render
      loop)
      -

      How to make the API interoperable between JS and WASM
      -

      Is the WASM host mapping buffers in the WASM module memory space a
      thing? (glMapBuffer equivalent)
      -

      Multithreading story for APIs
      -

      API extensibility in WASM (can't just add a dictionary entry like JS)
      -

      Any questions you might have!
      -

   DM: Bitflags and use of enums

Individual design and prototype status

   -

   Apple
   -

      MM: working on shading language implementation. Hoping to have
      finished SPIR-V codegen, not quite done. Have rewriter piece
(take logical
      mode program and rewrite to not have pointers or array references). Good
      progress overall.
      -

      MM: also checked in demo API to WebKit, and impl on top of Vulkan!
      https://trac.webkit.org/browser/webkit/trunk/Tools/WebGPUAPIStructure
      -

      DJ: apologize for posting one link to Github which turned into a
      beast of an Issue. Going to split it into 4 issues regarding shading
      languages.
      -

      CW: can we have a document from Apple about the security features,
      etc. of Secure HLSL?
      -

         DJ: yes, Apple working on a position paper
         -

   Google
   -

      CW: progress on TPAC demo, using WebAssembly shader compiler in the
      webpage demo, so no binary numbers in JavaScript
      -

         DJ: so this is a native app cross-compiling to WASM?
         -

         KN: yes. Compiler and library for compiling GLSL to SPIR-V.
         -

         DJ: cool, how big is exe?
         -

         KN: about 500K for a functional binary.
         -

         CW: talked with David Neto. Think we can make it smaller. GLSLang
         is large for no reason.
         -

         DJ: shaderc is the name of Khronos ref compiler?
         -

         CW: no. It’s a wrapper around GLSLang. Helpers for sharp input,
         more.
         -

         KN: only exposed 4 entry points. Maybe exposing more will make it
         bigger. Aggressive dead code elimination.
         -

         DJ: think this is really good progress.
         -

      KN: shaderc builds without intervention!
      -

         It’s checked into Kai’s private github shaderc fork right now.
         Planning to make it less hacky before upstreaming. [Kai to
provide link]
         -

   Microsoft
   -

      RC: responding to Dean’s mega-issue. No progress on code.
      -

      CB: looking at designs for how to implement some of Apple’s proposals
      in HLSL (in the DXIL toolchain)
      -

   Mozilla
   -

      DM: lots of internal work. Pipeline states, synchronization
      guarantees. Nothing public-facing.
      -

   Others?

Resource updates

   -

   DJ: how to get a buffer from the app into the GPU. Talked with MM a bit
   earlier.
   -

   CW: sent on mailing list earlier today design of mapping buffers, etc.
   in NXT. Kai can cover it.
   -

   MM: unfort. didn’t see CW’s email before now.
   -

   CW - offline - is ok, Kai can cover it
   -

   MM: how to schedule updates so they’re not modified by the CPU while the
   GPU’s using them
   -

      One solution: any buffer updates have the same pass type structure as
      a render or compute pass
      -

      2 properties:
      -

      1. The host program isn’t able to read/write to the [?/] point they
      choose. They go in at a particular point in the pipeline. Only can happen
      when GPU’s not using resource.
      -

      2. Passes already have sync points before and after them. These are
      natural points where CPU/GPU sync occurs. The sync’s already
handled at an
      API point of view. (Explicit vs. not is not being discussed right now.)
      -

   RC: how many copies of the ArrayBuffer will be made when they call the
   update function?
   -

   MM: envisioning two things, uploads and downloads.
   -

      Uploads: make initial copy or handoff.
      -

      KR: You can transfer ArrayBuffers, or you could use
      SharedArrayBuffers.
      -

      MM: API has access to resource. Map, copy, unmap.
      -

      Downloads: caller would provide handle to resource, offset, range.
      Return Promise, and Promise would give access to ArrayBuffer.
      -

   JG: we are coincidentally discussing this exact topic in the WebGL WG.
   Maybe some cross-pollination is useful. Not resolved in that WG yet, but
   have some ideas.
   -

      With downloads from GPU: want to enqueue in the command stream a copy
      into system memory. Need some way to wait on this or poll it to see when
      it’s done.
      -

      Minimum number of copies. Downloading into shmem.
      -

      Couple of possibilities for sync. Using FenceSync, or using Query to
      which the readback is attached.
      -

   MM: no strong opinion of whether it should be callback or polling based.
   Key issue is that at point where you tell the API you want a download to
   occur, that request is associated with a particular point in the queue.
   Operations before that in the queue will be visible; operations after it
   will not be.
   -

      JG: harder in Vulkan concepts. Easy in OpenGL-style concepts where
      command stream is more serial.
      -

      In Vulkan, there’s a looser guarantee of when things happen.
      -

      MM: maybe earlier statement was too strong. At point you enqueue
      upload/download request, it has the same sync properties we’ve been
      discussing. If it’s encapsulated into a pass, then it all works together.
      -

      JG: thinking of staging buffers from D3D9. Always using copy
      mechanisms, even from GPU-GPU. Resource type “CPU Resource”, and enqueue
      copy from GPU to CPU resource. End up with local memory instead
of GPU mem.
      -

      MM: probably something we can agree to. Whether source is ArrayBuffer
      or some API object that was created from one, doesn’t really matter to
      Apple.
      -

      RC: think this is similar to what Corentin proposed. He didn’t call
      it “staging” but it’s a buffer you fill, and there are 3 states. Once in
      command stream, can’t touch it. There was an alternate one supporting
      different ranges in the buffer.
      -

   MM: sounds like we’re fairly in agreement, some details to be worked out.
   -

   MM: next topic: in Vulkan (and D3D?), concept of region of memory
   (heaps?). A resource created in one region might be CPU-visible but slow.
   Maybe another region not CPU visible, but fast.
   -

      Texture might be in tiled form or linear. Maybe in tiled form can’t
      read/write it from the CPU, but some CPUs might not be able to
sample from
      linear texture.
      -

      Common in both APIs for a data upload/download to be done in 2
      phases. First upload into a resource that’s CPU accessible /
slow / linear.
      Phase 2 is to issue a GPU command to copy from slow resource to fast
      resource.
      -

   DM: in Vulkan it’s device memory. Some implementations may introduce
   memory types that are fast and visible to the CPU.
   -

   MM: question is: should these sort of staging resources be explicitly
   described by the API?
   -

      Thinking is: if all uploads and downloads will take this “hop”
      approach, and it’s possible for the web author to do this wrong, and
      possible for an impl to [??] these buffer locations, then API should not
      have this distinction of buffer type
      -

      Instead, the API should have a buffer which has 2 or 3 buffers behind
      it.
      -

      Might be a couple of buffer copies behind the scenes.
      -

   RC: so you don’t want to have staging resources? Or handle it all for
   you?
   -

   MM: the latter. The web developer doesn’t say anything about staging
   buffers. All buffers readable/writable from CPU, and sample-able from pixel
   shader. But might not be true. API responsible for marshalling data between
   platform textures.
   -

   KN: so talking about hiding differences between unified memory
   architectures and discrete memory architectures?
   -

   MM: yes. on UMA there is a texture type that supports fast CPU access.
   But on discrete GPU the marshalling will happen. Upside is that web
   author’s code doesn’t care about the distinction. Code will execute
   optimally.
   -

   DM: are you sure it’s optimal? If you’re reading something back to the
   CPU every frame, and know you have UMA, you can schedule the readback
   earlier. If we do it automatically, can’t do it as efficiently.
   -

      Ex: queue up something and read it back to CPU each frame. Fill on
      GPU. Enqueue transfer from GPU to CPU local memory. Later, map the buffer
      and read it. MM suggesting to combine the last two phases, so there’s no
      opportunity to schedule these things more optimally.
      -

   MM: if this happens every frame the browser can figure it out. Or,
   buffer structure can take a hint parameter about its usage. (“Try to keep
   close to the CPU”)
   -

   KN: sounds good for uploads, but don’t see it working well for downloads.
   -

   JG: have some concerns about this.
   -

   DJ: next step: Myles writes an issue for this.

Agenda for next meeting

   -

   Let’s skip next week because of TPAC
   -

   Let’s continue buffer upload/mapping discussion in the meeting after
   that. Maybe discuss shading languages after that.
Received on Monday, 6 November 2017 22:22:57 UTC