Re: DX12 Do's and Don'ts from Dzmitry Malyshau on 2019-05-06 (public-gpu@w3.org from May 2019)

From: Dzmitry Malyshau <dmalyshau@mozilla.com>
Date: Mon, 6 May 2019 14:40:22 -0400
To: public-gpu@w3.org
Message-ID: <451c074f-feb3-f16c-df12-135f046198d9@mozilla.com>
Thank you for the link, Ken!

 > - Expect to write a separate render path for each IHV minimum, 
because the app has to replace the driver's reasoning about how to 
efficiently drive the hardware.

I believe we are addressing this by a combination of:

   1. some level of automation on the users behalf, i.e. we the memory 
allocation and topology is hidden

   2. solid defaults, i.e. maximum number of resource groups and vertex 
buffers used

   3. exposed limits to some capacity and/or allowing the user to 
request specific limits when creating a device


 > - Compile pipeline state objects on background threads, because 
that's where shader compilation happens.

This would be addressed internally by the implementations.

 > - Record command buffers on multiple threads in order to achieve best 
parallelism.

The plan is to encourage the users to record command buffers on the web 
workers, if possible. This, however, doesn't necessarily translate to 
the parallel recording at the API level, since some implementations need 
to cross IPC (or use "Wire" protocol). Those are going to be 
implementation differences. Hopefully, the users would see the 
difference when recording commands on multiple workers.


 > - Avoid changing the root signature unnecessarily as doing so is costly

This one is also related to root constants:

- Constants that sit directly in root can speed up pixel shaders 
significantly on NVIDIA hardware – specifically consider shader 
constants that toggle parts of uber-shaders


 From these two points I would conclude that implementing push constants 
via dx12 root constants would not be very efficient, and the latter are 
better suited for low-frequency changing switches, almost like 
specialization constants (but not that drastic). It's not clear to me at 
this point how the implementations would take the benefit of the root 
constants being fast.


Other things that are relevant:

- Limit the shader visibility of CBVs, SRVs and UAVs to only the 
necessary stages

Related to our discussion on defaults, whether resources should be 
visible to all stages by default, or none of them.

- Use split barriers when possible. This helps the driver doing a more 
efficient job.

We haven't considered the use of split barriers in wgpu-rs internally 
yet. It doesn't appear feasible when live-recording a command buffer 
(unlike the approach of encoding at submit time).


Thank you,

Dzmitry

On 5/6/19 2:03 PM, Ken Russell wrote:
> NVIDIA recently published an article "DX12 Do's and Don'ts" which is 
> an interesting read. There are many, many details that should be 
> managed correctly in order to achieve a performant DX12 application.
>
> https://developer.nvidia.com/dx12-dos-and-donts
>
> Some points made include:
>
>  - Expect to write a separate render path for each IHV minimum, 
> because the app has to replace the driver's reasoning about how to 
> efficiently drive the hardware.
>
>  - Compile pipeline state objects on background threads, because 
> that's where shader compilation happens.
>
>  - Record command buffers on multiple threads in order to achieve best 
> parallelism.
>
>  - Avoid changing the root signature unnecessarily as doing so is costly
>
> And many more.
>
> Any thoughts on this article and how WebGPU abstracts away any of its 
> concerns?
>
> -Ken
>
Received on Monday, 6 May 2019 18:40:49 UTC