- From: Corentin Wallez <cwallez@google.com>
- Date: Tue, 28 Nov 2017 15:23:22 -0500
- To: public-gpu <public-gpu@w3.org>
- Message-ID: <CAGdfWNMm9BK16fi06eFY=qd+Q61-RTxaCtxeB33h6z-FP0t1ig@mail.gmail.com>
GPU Web 2017-11-15 Chair: Dean Scribe: Ken w/ help from Kai Location: Google Hangout Minutes from last meeting <https://docs.google.com/document/d/1oI7_SbWYs2bO0D4NqjOd8L7wAqNAswZa2nDfE1xmy9M> TL;DR - Recap of the WASM CG meeting. - Status updates: - Apple: Has codegen from Secure HLSL to SPIR-V - Google: Wrote docs about NXT memory barriers and buffer mapping. - Resource upload/mapping - Goal of NXT’s design is to minimize copies. Explanation of the transition concept. - Having resources being mapped is the best way to minimize copies in UMA. - Discussion of having all resources be stageable vs. requiring a creation flag. Become discussion of hints vs. explicit parameters. - NXT’s design uses a callback to signal when the mapped buffer is ready, discussion whether this is necessary to avoid CPU-GPU races. - No meeting next week. Tentative agenda - Administrative stuff (if any) - Individual design and prototype status - Resource upload / mapping - Agenda for next meeting Attendance - Apple - Dean Jackson - Myles C. Maxfield - Google - John Kessenich - Kai Ninomiya - Ken Russell - Microsoft - Chas Boyd - Rafael Cintron - Mozilla - Dzmitry Malyshau - Jeff Gilbert - Yandex - Kirill Dmitrenko - Elviss Strazdiņš - Markus Siglreithmaier Administrative items - Later: brief recap of TPAC meeting - License: Dean: still no update from legal folks - Microsoft: no updates - Next week’s Thanksgiving in the USA; ~half group will be on vacation - Will skip next week’s meeting - Myles: we’ve also been on a particular cadence about shading languages - Topic for next discussion (in two weeks) will be shading languages - Myles: brief recap of TPAC meeting: - ~4 broad topics covered: - How an object model would look in WebAssembly - WebAssembly today has 4 types. Two integer and two floating-point types. No native support for a “WebGPUDevice”. - How would opaque types be handled? - Identified by number to the application. When WASM calls a WebGPU function, passes the number, and the bindings look up in a table and pass the object to an implementation function. - Kai: Tables concept exists already. Right now there’s exactly one table. Idea is to add multiple tables per type for WebGPUTexture, WebGPUBuffer, etc. - How to do threading - In general: want the option to have threading in the WebGPU API - Kai: they’re close (or already have) SharedArrayBuffer shared among threads including atomics - Kai: WASM group said they haven’t thought much about sharing opaque objects between threads. SharedArrayBuffer is the only thing that can be shared concurrently between web workers. Transferables allow objects to be given between web workers. - We might want to add more “shareable” objects. - Property bags / extensions / expandability - Have structs / fields / know where these lie - In Chicago, agreed there should be a way to do extensions - Probably don’t want a model where a struct is as big as every possible extension; probably want a more abstract “property bag” or “dictionary” API where the WASM author specifies they want these few extensions. - WASM doesn’t have any affordances for this right now but would be happy to investigate - Ken: questions about how many primitives really need to be added or whether much of this can be done in the bindings - MM: talked about this. Discussed 2 models, one which was a new primitive, and the other which was to wrap the property bags inside opaque objects. Make calls to set/get things. - Kai: Like a C interface to JavaScript API. Emscripten already does this for WebGL. The main issue is performance. - Buffer mapping - Would be great if a WASM program could load/store GPU memory using wasm load/store ops, not function calls. Talked about it, have some ideas, but nothing exists yet. - Two models: special ArrayBuffer, where Map is an ArrayBuffer and loads/stores are function calls. Second: WebAssembly could mmap() and replace region of WASM heap with the mmap’ed GPU memory. - DM: Q about property bags: - Vulkan has a standard struct and a pointer at the end to chain extensions together. If we make the property bag be a field at the end of the standard struct, while basics are regular fields, can defer solution of property bags until after MVP. Could consider this one of the choices. - In Mozilla’s WebGPU prototype, all objects have an ID; they’re a bag on the GPU process side and can’t represent them on the client side. Would be nice to avoid double interaction of IDs, though it’s an impl detail from the user’s point of view. - MM: correct that this is an impl detail. We shouldn’t impose restrictions on browsers that don’t use that design. Could be security issues there, for example if have one GPU process and all textures are represented in a single array, because have to ensure that can’t see other pages’ textures. Individual design and prototype status - Apple - MM: now have a SPIR-V codegen phase for JavaScript compiler (from Secure HLSL)! Not fully featured but has compiled at least one shader. - This SPIR-V works in Apple’s example API structure for WebGPU that’s checked in to the WebKit repo, implemented on top of Vulkan! - Now can use the best shading language in the world with the best API structure in the world. :) :P - At some point will implement a Metal backend for this library. - Google - CW: written two documents that are updated views about memory barriers and buffer mapping. Wrote them to hopefully use in today’s discussion. - Uploaded a bunch of changes to index formats where they live in the pipeline instead of the command buffer - Microsoft - RC: no status updates. Digesting emails and documents. - Mozilla - DM: nothing substantial Resource upload / mapping - Documents from Google: - Memory barriers https://docs.google.com/document/d/1k7lPmxP7M7MMQR4g210lNC5TPwmXCMLgKOQWNiuJxzA - Buffer Mapping https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg - CW: in NXT the idea for data uploads was to have one path which minimizes copies. To do this you have some sort of buffer mapping concept. Want to eliminate copies from browser side into the WebGPU visible side. - So questions: how to make mmapping safe? And how to avoid memory races? - In NXT, each resource can be in combination of read-only formats, *or* in a writeable format. This restriction *eliminates* data races. - When you map a buffer, that only works when it’s in the correct state. It can’t be used for anything else except being mapped. No reading vertices from it, etc. This interaction means that NXT is data race free in this regard, as well as eliminates copies behind the scenes. Also, GPU won’t read from the buffer until you’re done writing it. - MM: questions: - 1. Docs sounded like a resource could be in a state where it’s readable by the CPU and usable by the GPU? - A: KN: : when you create a resource you specify the allowed usages. Then later you specify the current usage. - If a buffer has the allowed usage MapWrite, the only other valid allowed usage is TransferSrc. Then, the current usage of the buffer, at any given time, can only be either MapWrite or TransferSrc. - CW: probably want buffers to be mappable and more than just transfer sources or destinations. Gets tricky because there are some other interactions. - MM: that was my thought. If the only way to use the feature of mapping data on the CPU was to “prepare” a future copy, which is a command enqueued in the command stream, you haven’t reduced the copies. - CW: there are always two copies. First, when you write the data on the CPU: uncompress an image, etc. Then, on a discrete GPU, a copy from CPU visible memory to GPU local memory for best performance. - (When you map the buffer you write as much as possible into GPU-visible memory. Decompressing data into GPU-visible buffer. Copying from CPU&GPU visible memory to GPU local memory.) - This is the minimal number of copies. - But: the current design isn’t great for UMA yet. Need to figure out better how to make that work. - MM: Apple’s primary platform is UMA. - CW: for UMA: have an extension “you’re on UMA”? App would need to do the right thing. - Second option: two types of mapping. (1) Transfer buffer only, app has to do a copy; or (2) “staged”: let me map this buffer even though it’s GPU local (incurs extra staging buffer and copy on discrete, but not on UMA). - Think it’s similar to Apple’s approach where all 3 usages are always allowed. Makes things simpler on GPUWeb impl side because you know what things have staging copies or not. - MM: why not allow all resources to have staging copies? - CW: it’s less explicit. It doesn’t want all resources to be staged. Apps want to have control over this. If we make all resources staged then it’s blurry. - JG: from OpenGL and D3D12, have types of buffers based on heap type and/or usage hint during BufferData. Provides an optimization hint about how the resource will behave. You can do whatever you want. Can map upload heaps; it’ll be slow. But it lets the impl optimize for uploads and minimize the number of copies. - MM: think we should have hints about where things should be. - CW: hints aren’t great. The app should say “I’m going to use this thing this way”. You need to support all paths in the impl anyway. If the app says that it’ll only use a resource in a particular way then it can optimize things much more. - CW: glBufferData(.., GL_STATIC_DRAW). Driver assumes it lives in GPU local memory. But you can really do whatever you want. Many OpenGL drivers have heuristics and use the hint to prime the heuristic. But the impl still needs to have usage counters, etc, and optimizations. A hint that doesn’t enforce anything makes things worse. Still have to handle all possible cases as the implementation. - CW: in ANGLE we have up to 4 copies because it could be used in 4 different ways. Since it’s just a “hint”, there’s no way to eliminate this. - KR: A huge goal of new explicit API is to remove the possibility of implicit shadow copies, and give the memory control to the application. - JG: appreciate how it can make things easier if you can reliably say “I just need a copy of this”, but at the same time the primitive we need is “i have a resource, and it’s either CPU mappable or not”. Based on CPU mappable, use a staging buffer or not. Sophisticated impl might decide more behind the scenes. Also possibility of polyfills: can copy to a staging buffer and back, and give you a mmapp’ed copy later. But let’s focus on the primitives the hardware needs to give to us. - MM: reducing the API to the lowest level primitives isn’t always the best way to make an API. Many ways to judge an API. - JG: main idea is that you can do whatever you want if you have these primitives. But if you use a higher level primitive then you force this decision on users of the API. - MM: there’s a difference between letting someone write a library and forcing them to do so. - CW: hope was that the buffer mapping proposal would seem simple enough and also provide low level primitives to satisfy both points of view. Get the best performance but at the same time it’s easy to use. - In NXT, we have one extra affordance: sometimes you want to put the data in the buffer right now, and not have to implement ring buffers, etc. Proposing: “put this data in this buffer right now”, too. - MM: might be a solution where both are available in the API. Not sure we’ve done enough research to know. Myles would like to see something written about the UMA approach. Think it’s possible to come to a compromise. - CW: AI: write proposal to cover UMA and minimizing number of copies. - JG: question about callback mechanism. Why not just use fences and signaling? - KN: NXT doesn’t have fences and signaling yet. :) - CW: also, with a callback, the buffer only becomes valid once the callback is passed. Makes it easier in Chrome where we may need to do fences, etc., and where memory may or may not become valid in a certain order. Also complexities with multiple queues. - JG: callbacks add another weird lifetime management issue. What thread do they run on, etc. - CW: it’s a Promise. The only reason it’s a callback is that there’s no Promise handling in our code generators yet. Part of the UI that we didn’t spend time on. - RC: how’s the callback implemented? - CW: Implemented in NXT by, on D3D12, incrementing a fence periodically. Once the counter reaches a certain value (the GPU stream has executed past that point), the callback is fired. - RC: interesting that we just had the discussion about having a library vs. making things easier, and here’s a way that NXT’s making things easier. - RC: D3D12 has a way to set an Event when a GPU instruction is passed. - CW: might be able to do this. - DM: don’t understand why we’ve made a separate callback mechanism for each resource. And different usage transitions, outside the command buffer. - JG: are callbacks a germane or incidental part of the proposal? - CW: our proposal does have a statement “you can now read the pointer”. But the way it’s said isn’t important. Set a flag, call a callback. etc. - DM’s question was about why there’s a transition at the device level / immediate command when other transitions are in command buffer. - It’s just the map operation that can’t be in the command buffer. - DM: so you have a separate mechanism for mapping where you could just put a fence in the command buffer? - DM: in D3D12 and Vulkan, you’d wait on a fence that gates executing the GPU command buffers. When wait is over, user makes sure that they’ve transitioned the resource to a mappable state. Then they map it, etc. Would need some mechanism through fences to call back through the GPU anyway. Seems stuff done for buffers is duplicating the mechanism. - CW: part about transition to MAP_READ state is because we are trying to prevent data races (single-write-multiple-read pattern like Rust borrowing). Need to flush CPU cache, etc. - CW: When you submit a command buffer that’s transitioned a bunch of buffers to MAP_READ with this serial number on this queue -- when you map, we make sure the app can’t see memory before it’s available, for both security and portability. - DM: We are going to have a general mechanism for CPU to call back on the GPU. Call map(), etc. - CW: how will the browser enforce that the memory is up to date for the app? - CW: think we should enforce portability as long as it doesn’t have performance impact. This is the same thing the app would do itself, but enforce that the app can’t see e.g. uninitialized GPU memory. - MM: agree with CW: avoiding these data races is important. - CW: hope these documents help you agree that the performance is the same as if the application did it itself. - DM: have yet to read all documents. May come with follow-on questions in email. - MM: general philosophy is that if the browser can do the same thing the web author is doing, the browser should do it, is the best decision. - DM: concerned about the redundancy in the API. - KR: Very complicated (in multiprocess browser) to avoid synchronous stalls and provide a mapping primitive. Synchronous MapBuffer is impossible. Browser either incurs extra copies, or stalls to sync between GPU, GPU process, and JS process. - JG: it’s easier in WebGL because there’s a monotonic queue and a before/after effect with fences. Less clear in Vulkan. Think there are ways to do this which preserve the Map style API while still allowing remoting implementations to do this well, capably and transparently. Relies heavily on usage hints or attribute settings to indicate whether a buffer read will happen or not, hold on to a shmem you’re going to copy into, holding on to staging buffers. Issues about signaling the client. But think it can be done. Agenda for next meeting - Next week: no meeting (US Thanksgiving). - In two weeks: shading languages. - AI: Apple to provide a document on Secure HLSL - AI: Google to draft a small SPIR-V execution environment for WebGPU; how it would be embedded, etc. - Next non-shader meeting: - AI: Google to make a proposal for UMA.
Received on Tuesday, 28 November 2017 20:24:09 UTC