Re: DX12 Do's and Don'ts from Myles C. Maxfield on 2019-08-01 (public-gpu@w3.org from August 2019)

From: Myles C. Maxfield <mmaxfield@apple.com>
Date: Thu, 01 Aug 2019 09:25:30 -0700
To: Dzmitry Malyshau <dmalyshau@mozilla.com>
Cc: public-gpu@w3.org
Message-id: <D569D7F6-B428-4874-81EB-D6EDE1403EBE@apple.com>
The biggest takeaway I got from this is “try many things and measure which option is best.”

E.g. "Reuse fragments recorded in bundles” vs "Don’t expect lots of list reuse”
"Don’t record ... just very few command lists” vs "Don’t create too many ... command lists”
"Minimize the number of Root Signature changes” vs "Don’t bloat your root signature and descriptor tables to be able to reuse them”

Also, the fact that it says engines need hardware-specific codepaths for each IHV is scary in the context of a Web API.

> On May 6, 2019, at 11:40 AM, Dzmitry Malyshau <dmalyshau@mozilla.com> wrote:
> 
> Thank you for the link, Ken!
> 
> > - Expect to write a separate render path for each IHV minimum, because the app has to replace the driver's reasoning about how to efficiently drive the hardware.
> 
> I believe we are addressing this by a combination of:
> 
>   1. some level of automation on the users behalf, i.e. we the memory allocation and topology is hidden
> 
>   2. solid defaults, i.e. maximum number of resource groups and vertex buffers used
> 
>   3. exposed limits to some capacity and/or allowing the user to request specific limits when creating a device
> 
> 
> > - Compile pipeline state objects on background threads, because that's where shader compilation happens.
> 
> This would be addressed internally by the implementations.
> 
> > - Record command buffers on multiple threads in order to achieve best parallelism.
> 
> The plan is to encourage the users to record command buffers on the web workers, if possible. This, however, doesn't necessarily translate to the parallel recording at the API level, since some implementations need to cross IPC (or use "Wire" protocol). Those are going to be implementation differences. Hopefully, the users would see the difference when recording commands on multiple workers.
> 
> 
> > - Avoid changing the root signature unnecessarily as doing so is costly
> 
> This one is also related to root constants:
> 
> - Constants that sit directly in root can speed up pixel shaders significantly on NVIDIA hardware – specifically consider shader constants that toggle parts of uber-shaders
> 
> 
> From these two points I would conclude that implementing push constants via dx12 root constants would not be very efficient, and the latter are better suited for low-frequency changing switches, almost like specialization constants (but not that drastic). It's not clear to me at this point how the implementations would take the benefit of the root constants being fast.
> 
> 
> Other things that are relevant:
> 
> - Limit the shader visibility of CBVs, SRVs and UAVs to only the necessary stages
> 
> Related to our discussion on defaults, whether resources should be visible to all stages by default, or none of them.
> 
> - Use split barriers when possible. This helps the driver doing a more efficient job.
> 
> We haven't considered the use of split barriers in wgpu-rs internally yet. It doesn't appear feasible when live-recording a command buffer (unlike the approach of encoding at submit time).
> 
> 
> Thank you,
> 
> Dzmitry
> 
> On 5/6/19 2:03 PM, Ken Russell wrote:
>> NVIDIA recently published an article "DX12 Do's and Don'ts" which is an interesting read. There are many, many details that should be managed correctly in order to achieve a performant DX12 application.
>> 
>> https://developer.nvidia.com/dx12-dos-and-donts
>> 
>> Some points made include:
>> 
>>  - Expect to write a separate render path for each IHV minimum, because the app has to replace the driver's reasoning about how to efficiently drive the hardware.
>> 
>>  - Compile pipeline state objects on background threads, because that's where shader compilation happens.
>> 
>>  - Record command buffers on multiple threads in order to achieve best parallelism.
>> 
>>  - Avoid changing the root signature unnecessarily as doing so is costly
>> 
>> And many more.
>> 
>> Any thoughts on this article and how WebGPU abstracts away any of its concerns?
>> 
>> -Ken
>> 
>
Received on Thursday, 1 August 2019 16:28:54 UTC