- From: Kai Ninomiya <kainino@google.com>
- Date: Mon, 5 Aug 2019 16:17:54 -0700
- To: "Myles C. Maxfield" <mmaxfield@apple.com>
- Cc: Dzmitry Malyshau <dmalyshau@mozilla.com>, public-gpu <public-gpu@w3.org>
- Message-ID: <CANxMeyBaemqj8j+BpyF4GOA6kcLT_spuP47UJFsUSjUe_oqdyQ@mail.gmail.com>
It isn't very specific about exactly how much performance is left on the table by not having hardware-specific code-paths. I suspect it is not that huge. Applications that want to optimize for different hardware should be empowered to do so, but I don't think it's critical for the vast majority of developers. And there are probably some hardware specific optimizations that we can do inside of implementations as well in the long run. I just don't think we can possibly get into the business of making the API write-once-run-everywhere-with-maximum-performance. On Thu, Aug 1, 2019 at 9:29 AM Myles C. Maxfield <mmaxfield@apple.com> wrote: > The biggest takeaway I got from this is “try many things and measure which > option is best.” > > E.g. "Reuse fragments recorded in bundles” vs "Don’t expect lots of list > reuse” > "Don’t record ... just very few command lists” vs "Don’t create too many > ... command lists” > "Minimize the number of Root Signature changes” vs "Don’t bloat your root > signature and descriptor tables to be able to reuse them” > > Also, the fact that it says engines need hardware-specific codepaths for > each IHV is scary in the context of a Web API. > > > On May 6, 2019, at 11:40 AM, Dzmitry Malyshau <dmalyshau@mozilla.com> > wrote: > > > > Thank you for the link, Ken! > > > > > - Expect to write a separate render path for each IHV minimum, because > the app has to replace the driver's reasoning about how to efficiently > drive the hardware. > > > > I believe we are addressing this by a combination of: > > > > 1. some level of automation on the users behalf, i.e. we the memory > allocation and topology is hidden > > > > 2. solid defaults, i.e. maximum number of resource groups and vertex > buffers used > > > > 3. exposed limits to some capacity and/or allowing the user to request > specific limits when creating a device > > > > > > > - Compile pipeline state objects on background threads, because that's > where shader compilation happens. > > > > This would be addressed internally by the implementations. > > > > > - Record command buffers on multiple threads in order to achieve best > parallelism. > > > > The plan is to encourage the users to record command buffers on the web > workers, if possible. This, however, doesn't necessarily translate to the > parallel recording at the API level, since some implementations need to > cross IPC (or use "Wire" protocol). Those are going to be implementation > differences. Hopefully, the users would see the difference when recording > commands on multiple workers. > > > > > > > - Avoid changing the root signature unnecessarily as doing so is costly > > > > This one is also related to root constants: > > > > - Constants that sit directly in root can speed up pixel shaders > significantly on NVIDIA hardware – specifically consider shader constants > that toggle parts of uber-shaders > > > > > > From these two points I would conclude that implementing push constants > via dx12 root constants would not be very efficient, and the latter are > better suited for low-frequency changing switches, almost like > specialization constants (but not that drastic). It's not clear to me at > this point how the implementations would take the benefit of the root > constants being fast. > > > > > > Other things that are relevant: > > > > - Limit the shader visibility of CBVs, SRVs and UAVs to only the > necessary stages > > > > Related to our discussion on defaults, whether resources should be > visible to all stages by default, or none of them. > > > > - Use split barriers when possible. This helps the driver doing a more > efficient job. > > > > We haven't considered the use of split barriers in wgpu-rs internally > yet. It doesn't appear feasible when live-recording a command buffer > (unlike the approach of encoding at submit time). > > > > > > Thank you, > > > > Dzmitry > > > > On 5/6/19 2:03 PM, Ken Russell wrote: > >> NVIDIA recently published an article "DX12 Do's and Don'ts" which is an > interesting read. There are many, many details that should be managed > correctly in order to achieve a performant DX12 application. > >> > >> https://developer.nvidia.com/dx12-dos-and-donts > >> > >> Some points made include: > >> > >> - Expect to write a separate render path for each IHV minimum, because > the app has to replace the driver's reasoning about how to efficiently > drive the hardware. > >> > >> - Compile pipeline state objects on background threads, because that's > where shader compilation happens. > >> > >> - Record command buffers on multiple threads in order to achieve best > parallelism. > >> > >> - Avoid changing the root signature unnecessarily as doing so is costly > >> > >> And many more. > >> > >> Any thoughts on this article and how WebGPU abstracts away any of its > concerns? > >> > >> -Ken > >> > > > > >
Attachments
- application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Monday, 5 August 2019 23:18:31 UTC