- From: Corentin Wallez <cwallez@google.com>
- Date: Mon, 11 Dec 2017 18:20:43 -0500
- To: public-gpu <public-gpu@w3.org>
- Message-ID: <CAGdfWNPHJxHV94sPQhT5kcBU8dGHM-oRjPf8M3E=s3R=xaTEBA@mail.gmail.com>
GPU Web 2017-12-06 Chair: Corentin Scribe: Ken Location: Google Hangout Minutes from last meeting <https://docs.google.com/document/d/12JQDw8gNE2MslMWRcK5sjZ3CsafhzIRZ6RZDyX-8MZk> TL;DR - Status updates - Apple: posted a Secure HLSL <https://github.com/gpuweb/gpuweb/wiki/Secure-HLSL-Explainer> doc - Google: posted a UMA buffer mapping doc discussed in this meeting - Mozilla: specialization constants in gfx-rs - Buffer Mapping and UMA doc <https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg/edit#heading=h.1qqspik2b22p> - Google proposing alternative 2, a UMA extension - Discussion of alternatives to callbacks for synchronization: promises, or synchounous non-blocking queries. - WEBGL_get_buffer_sub_data_async discussions don’t bridge to WebGPU because the structure of the APIs is different. - Should ask ISV input to know what’s most useful to them. - Persistently mapped buffer (PMB) is important to applications. The proposal doesn’t expose them directly - Concern that PMBs wouldn’t be device independent. - One more discussion of device independence vs. best performance. Suggestion we have opt-in device dependent behavior. - Do we want to be able to map-write to half a buffer while the other half is in use? Tentative agenda - Buffer mapping and UMA - Agenda for next meeting Attendance - Apple - Dean Jackson - Myles C. Maxfield - Theresa O'Connor - Google - Corentin Wallez - Victor Miura - Kai Ninomiya - Ken Russell - Victor Miura - Microsoft - Rafael Cintron - Mozilla - Dzmitry Malyshau - Jeff Gilbert - Markus Siglreithmaier - Yandex - Kirill Dmitrenko - Elviss Strazdiņš Buffer mapping and UMA Google’s buffer mapping doc with UMA alternative <https://docs.google.com/document/d/1HFzMMvDGHFtTgjNT0j-0SQ1fNU9R7woZ4JuNJdAXBjg/edit#heading=h.1qqspik2b22p> - CW: For UMA vs. discrete GPUs: - They’re contrary to one another. The thing to be efficient on UMA is different from discrete GPUs. - Want one code path that is efficient on both - Most efficient: either have two code paths, or have all resources be mappable - Creation flag whether a resource is mappable or not - Either: UMA extension that allows mapping resources, or implicit staging on the discrete GPU. - There was a third alternative: “I want to be able to map this and maybe it’ll be staged”, but that was strictly inferior. - MM: upon reading the 3 options was wondering: seemed all of the affordances for avoiding CPU-GPU data races had been removed from the UMA code, is that true? - CW: no, it’s the map ext async mechanism that was presented at the start of the document. - It’s a question of what’s allowed on what type of resource - Alternative 3: all resources can do anything. On UMA, can say “Map Write Async”. Get pointer. On UMA, writes directly to resource. Callback mechanism ensures no data races. On discrete, gives you back a staging buffer instead. When you unmap, it does a copy. - Is it clear this still prevents data races? - JG: why are we using async? - MM: because the GPU might be using the resource at the time the CPU wants it - CW: in NXT we want to eliminate data races - JG: usually have read/write exclusion. Don’t need async for that. - CW: yes, we could tell the application that you have to transition to MAP_READ usage and then wait for resource to be no longer used, then map synchronously. Would return if can’t guarantee simultaneous use. - Then app wouldn’t have to guess when WebGPU thinks it’s safe to write the resource. That’s why it’s async. - JG: callbacks are traditionally tricky to put into an API because you don’t know what’s on the call stack when the callback’s invoked. - MM: we get to design this - JG: it’s simpler if you don’t have these things. - CW: what about a Promise? - JG: same thing, more steps. Seems unnecessary. If you’re writing data out to the GPU, you WRITE_DISCARD something, write into it, and send it off to be committed into the GPU resources. In particular, why make writes async? - TOC: it’s unusual on the web for things to not be async - JG: it’s unusual to do low-level graphics APIs on the web - CW: there’s also MapWriteSync. Pipeline the write if necessary, otherwise schedule the write now. - To be race-free, need writeback stuff only when it’s in use by the GPU. How does the app know the resource isn’t in use by the GPU any more? - Can have a polling loop. But the callback tells you when it’s ready. Understood that callbacks aren’t preferred. - JG: think they’re not necessary. In investigation for WebGL’s GetBufferSubDataAsync, didn’t find any advantage for doing any writes async. If you want to use the same shmem and a deep enough pipeline that you want to rewrite into the shmem before it’s committed… - MM: this is one of the general discussions we’ve had between ease of implementability vs. use. From web’s standpoint it’s better to not have to ask every frame “is this resource ready”? - KD: if you have async then you’re typically forced to wait for the callback, put the operation into a queue, and schedule it. If you’re in the middle of an animation you don’t want to do expensive writes. There will be some scheduling around it anyway. Most applications carefully schedule expensive operations like writes. - VM: wouldn’t an application have to use fences/queries to ensure a buffer is writable? - JG: for reading yes, but for writing no. Say you update skinning data by splatting the data into a buffer. Next frame, you send a new update of the skinning data to the GPU. If it’s not ready to be written you would stall. - VM: you would use multi-buffer to ensure you don’t stall, and use some sort of synchronization. - RC: think Kirill is on the right track. In a typical web app you render in response to requestAnimationFrame. When rAF comes, you ask “which buffers can I use?” and check my own data buffers to figure out which can be used. Would use multiple buffers and put them on some list when their callbacks are called. Not much different from exposing to the user “this buffer is ready/not ready”. - MM: agree with Rafael - CW: agree with everything except that the app could do the same with fences, etc. Question is: if app doesn’t put in fences at all, puts a lot of pressure on the implementation to order things properly. Guaranteeing e.g. “all buffers before this fence are now ready”. - RC: just saying that a callback thing is worse or the same amount of effort as their managing their fences. Or we have to arrange our own fences so that we have enough bookkeeping to answer the user’s question. In general, against the use of callbacks. Do enough bookkeeping to tell users the current state of things. - MM: that bookkeeping you describe is exactly that infrastructure in Promises. - JG: would have to track which fences have passed, ... , and is the buffer in use / ready to write to. Orthogonal to promises/callbacks/etc. Track which fences have passed. Be able to tell if this buffer is ready. Can: trigger callback, set a boolean, trigger a promise. - MM: this sync problem needs to happen somewhere. The browser can do it just as well as the app. - JG: I found that using a synchronous command structure worked just as well. - KR: Things are not about getBufferSubDataAsync. We agree your proposal would be possible to do. We are concerned about how easy it is to fall off the fast path. For the readback case you need to do the readback and wait for a fence to pass before you can read the data. We would like things to be more explicit than that with an open query for the readback, found it makes the tracking easier in our implementation. - JG: Appreciate the feedback but GL is a well ordered API so for GL using fence should be ok. - KR: In our implementation it makes things harder with the ordering of fences and have to delay things. - CW: Let’s move on because a lot of these details are very WebGL specific. - KR: The question is whether we should do async vs. sync for WebGPU. For developer convenience it seems we should follow the native explicit APIs and have a synchronous API. - VM: may want to talk with ISVs about what they expect. They probably have a synchronous pipeline. They may use async callbacks... - CW: ISVs will probably have strong opinions about this. Was assuming that async was best for the web, though sync might work. Whether you use callbacks, syncs, promises, etc. – can solve these differences – are there comments on the UMA proposal? - RC: to what degree should we hold the developer’s hand and make sure they can’t screw it up? Or if they screw it up and things don’t look good, is that OK? Should we go the extra mile to do validation so they can’t mess up? - MM: we should try very hard to make it very difficult for users to shoot themselves in the foot. Unless I’ve misunderstood, should be impossible for API to block the main thread. A sync API would cause the API to block, right? - CW: no, the sync version would be “can I map you now”? And if you try to map while the answer is “no”, you get null back and an error. All non-blocking. - CW: agree the API should prevent shooting yourself in the foot. Can see the difference between discrete GPU and UMA arch. On discrete GPU your app might work b/c your GPU has caches which hide things. On UMA, your writes are directly visible to the GPU, might be broken on UMA only. If you have a persistently mapped buffer that you can write to while the GPU’s using it, hard or impossible to write a validation layer. - DM: can’t we prevent persistent mapping in general? - MM: yes, if you can keep something mapped for 1000 frames, you’ll run into sync problems on some drivers. - DM: it’s a really powerful use case for some users. Not something that can be discarded trivially. Games will like to update parts of the texture atlas while others are in use. - MM: there’s a general principle that the web’s device independent. If you can write an app that works on your device but broken on all others that’s bad. - JG: Already made the Web device dependent - KR: Not true, in WebGL we got rid of MapBuffer because in Chromium it is not possible to have these synchronous APIs. A lot of work in WebGL is to make the API device independent. - JG: For example different limits in WebGL. - CW: I think we should learn from WebGL. If you want something above the limits you should have to ask for it. It’s a different discussion. - VM: Strongly opposed to fragmentation. Need to learn from the lessons we’ve learned. The more we can ensure write-once, run-everywhere across browsers and devices, the more this API will see adoption. - JG: past a certain point WORA doesn’t work for performant applications. Have to compromise on this a little bit for best performance. - VM: it’s a question of supporting the few or the many. Most developers can’t test on all devices. Strongly feel we should enable apps to work by default. - JG: yes. And if they use Three.js Version 2.0, they will. Don’t want to do too much handholding because the ecosystem will do a better job than we can. - CW: the goal isn’t to make an API that’s easy to use. Even if you’re an expert or a beginner, if you make an app that doesn’t ask for extensions or limits, it should work on another machine. - JG: agree. - CW: not sure that the app will be optimally performant, but it will work. Goal is to make it opt-in to be device dependent. - VM: we’re making user agents here. Try to prevent bad experiences for users. Shouldn’t be just up to the developer to “not mess it up”. - JG: strongly caution against going too far in that regard. Preventing screw-ups also prevent well-meaning developers from doing a good engineering job. - CW: goal is to have low-level, deterministic primitives. - CW: mapping a buffer persistently is scary from this standpoint. Opting the developer in to races. - JG: only if you allow multiple writes at the same time. Can we have a single-writer API? Read-write mutually exclusive. - MM: thought the problem was CPU vs. GPU collisions. - VM: this was a use case requested earlier. Writing to half a buffer while the other half’s in use. Is this something we want? - MM: maybe can have a system that makes this well defined. Partition buffers? - KN: think it’s not that difficult to make something that works this way but might have performance penalties. - CW: can make sub-buffers and persistently map a larger buffer. Can get in the way of robust buffer access checks done by the hardware. Can round-robin between multiple buffers. - MM: sounds like there’s a persistent mapping but it’s in the implementation? - CW: yes. Either you can map the whole buffer at once, can’t really do a ring buffer (instead ring buffer of buffers). But there’s an API “I want to update this sub-part of a buffer now”. Does a copy. - RC: so in your implementation the “buffer” the user sees is a window into a bigger API buffer? - CW: no, two mechanisms. Can update a buffer by mapping it – the discussion earlier– only map the whole thing. But, if that annoys you, you can say “I want to put this data at that offset in this buffer”. Essentially BufferSubData. - RC: does it let you do that while it’s in use in the GPU? - CW: yes. To do that you need the buffer in the TRANSFER_DESTINATION usage. - MM: agree with general tenets of this approach - of preventing you from writing into buffers that are in use by the GPU. Need to look at general mechanism in more detail. - RC: will our code for doing this automatically do better than code written by an application that knows what its intent is? - MM: depends on what you mean by “better”. Portable? definitely. If you’re writing for one specific device might be able to do better. - RC: meaning that the web developer might need fewer fences, group things using fewer primitives. - VM: interesting presentation from AMD: if you want the optimal thing might have to present a graph to the API. Neither the driver nor the developer can do the best thing for each architecture. - JG: how much of that graph is what we get from submitting command buffers with dependencies? - CW: let’s not get into a task graph API. The industry hasn’t converged yet. - JG: this is a task graph API already. - CW: NXT is an on-line task graph API. AMD’s proposal was an offline plus online task graph API. - JG: Vulkan’s a task graph API - KN: AMD was saying the way Vulkan’s task graph API isn’t enough – need more higher level stuff. - VM: need hindsight to do the optimal thing. Vulkan’s very low level. Have to expose more and more flags to the developer to do the right thing. Also can’t express future architectures. - JG: please put a link to that in the minutes - CW: will try to, not sure it was public. - RC: if we generalize this and make it so the web developer can’t shoot themselves in the foot, will their code be as performant as a developer doing it themselves? - CW: no. The best we can do is the best the developer can do. The question is how much we lose. I think we’ll lose very little. The worst you do is a ring buffer of buffers instead of a single buffer. That’s probably the biggest loss. Aside from that, we’re probably doing what any sensible app would do. - CW: while doing NXT’s Vulkan backend, have been doing this buffer upload stuff, and have been writing code exactly like what an application would do. We end up doing per-resource tracking, while the app could plausibly do this at a coarser granularity. - RC / CW: when is this checking done? Queue submits. - RC: if we think that we won’t lose much perf, there’s a lot to be said for things to be more portable and have fewer bug reports from people about this and it’s their bookkeeping. - JG: would like to learn more about when this would be the case and what guarantees the APIs we’re building on top of have. Would be good if we didn’t have to lock things down too badly. - KD: can we disallow people from shooting themselves in the foot on the web, but allow opt-in performance? - CW: scary because everyone will opt-in to the high performance unsafe mode. But, seems OK to add more knobs to apps that request them. - CW: goes back to UMA and buffer mapping. That’s why we’d like to have an UMA extension which provides that functionality. Opting into footguns that would make apps non-portable don’t sound OK. - JG: you’re saying, better to have performance footguns than correctness footguns? - CW: yes. If you have correctness footguns that give you 1% more perf, everyone will use them and the apps will be nonportable. - JG: for some developers, performance is part of the minimum level product. Let’s not throw it out blindly. Should be trying to avoid performance cliffs. - Google / CW: still somewhat happy with our proposal (async vs. sync is a problem that we’ll have to decide). Concern about persistent buffers used for ring buffers. Doesn’t account for sub-buffer usage. Could be a problem for ISVs but aren’t sure. Our points: avoid footguns but have no perf impact from avoiding footguns. - KR: think sync uplodads, async readbacks. - KN: in NXT, have both async and sync uploads in the design. Think async uploads have uses but not sure. - Apple / MM: should be difficult to make an app where uploads/downloads are non-portable. Should be an async API. Should be one code path for UMA and discrete. Safety and portability are important and should not be sacrificed. Some small perf penalty is worth that correctness. - Microsoft / RC: if we can make uploads portable and not impact perf then we should do that. If we do UMA as an extension, let’s do that. They added it to D3D12 because they thought there was a need for it. - Mozilla / JG: struggling to apply WebGL investigation to these APIs. Big Q is what ordering guarantees look like. Hope there’s a good solution that keeps these looser ordering guarantees in mind and allows very safe read/write exclusion that makes sense and prevent footguns. Think there’s not a strong benefit for async uploads. Think sync uploads are just as fast and don’t hurt. Readbacks, not so sure, esp. because orderings are looser. - Mozilla / DM: explicit memory management is one of the biggest advantages of the low level APIs. Persistent mapping is a critical feature. Would prefer to have only two timelines, the app side and one per hardware queue that executes it. This means a sync based API is preferred. (Can I map/use this.) Agenda for next meeting Next week: shading languages.
Received on Monday, 11 December 2017 23:21:33 UTC