Re: DX12 Do's and Don'ts from Kai Ninomiya on 2019-08-05 (public-gpu@w3.org from August 2019)

From: Kai Ninomiya <kainino@google.com>
Date: Mon, 5 Aug 2019 16:17:54 -0700
To: "Myles C. Maxfield" <mmaxfield@apple.com>
Cc: Dzmitry Malyshau <dmalyshau@mozilla.com>, public-gpu <public-gpu@w3.org>
Message-ID: <CANxMeyBaemqj8j+BpyF4GOA6kcLT_spuP47UJFsUSjUe_oqdyQ@mail.gmail.com>
It isn't very specific about exactly how much performance is left on the
table by not having hardware-specific code-paths. I suspect it is not that
huge. Applications that want to optimize for different hardware should be
empowered to do so, but I don't think it's critical for the vast majority
of developers.

And there are probably some hardware specific optimizations that we can do
inside of implementations as well in the long run. I just don't think we
can possibly get into the business of making the API
write-once-run-everywhere-with-maximum-performance.

On Thu, Aug 1, 2019 at 9:29 AM Myles C. Maxfield <mmaxfield@apple.com>
wrote:

> The biggest takeaway I got from this is “try many things and measure which
> option is best.”
>
> E.g. "Reuse fragments recorded in bundles” vs "Don’t expect lots of list
> reuse”
> "Don’t record ... just very few command lists” vs "Don’t create too many
> ... command lists”
> "Minimize the number of Root Signature changes” vs "Don’t bloat your root
> signature and descriptor tables to be able to reuse them”
>
> Also, the fact that it says engines need hardware-specific codepaths for
> each IHV is scary in the context of a Web API.
>
> > On May 6, 2019, at 11:40 AM, Dzmitry Malyshau <dmalyshau@mozilla.com>
> wrote:
> >
> > Thank you for the link, Ken!
> >
> > > - Expect to write a separate render path for each IHV minimum, because
> the app has to replace the driver's reasoning about how to efficiently
> drive the hardware.
> >
> > I believe we are addressing this by a combination of:
> >
> >   1. some level of automation on the users behalf, i.e. we the memory
> allocation and topology is hidden
> >
> >   2. solid defaults, i.e. maximum number of resource groups and vertex
> buffers used
> >
> >   3. exposed limits to some capacity and/or allowing the user to request
> specific limits when creating a device
> >
> >
> > > - Compile pipeline state objects on background threads, because that's
> where shader compilation happens.
> >
> > This would be addressed internally by the implementations.
> >
> > > - Record command buffers on multiple threads in order to achieve best
> parallelism.
> >
> > The plan is to encourage the users to record command buffers on the web
> workers, if possible. This, however, doesn't necessarily translate to the
> parallel recording at the API level, since some implementations need to
> cross IPC (or use "Wire" protocol). Those are going to be implementation
> differences. Hopefully, the users would see the difference when recording
> commands on multiple workers.
> >
> >
> > > - Avoid changing the root signature unnecessarily as doing so is costly
> >
> > This one is also related to root constants:
> >
> > - Constants that sit directly in root can speed up pixel shaders
> significantly on NVIDIA hardware – specifically consider shader constants
> that toggle parts of uber-shaders
> >
> >
> > From these two points I would conclude that implementing push constants
> via dx12 root constants would not be very efficient, and the latter are
> better suited for low-frequency changing switches, almost like
> specialization constants (but not that drastic). It's not clear to me at
> this point how the implementations would take the benefit of the root
> constants being fast.
> >
> >
> > Other things that are relevant:
> >
> > - Limit the shader visibility of CBVs, SRVs and UAVs to only the
> necessary stages
> >
> > Related to our discussion on defaults, whether resources should be
> visible to all stages by default, or none of them.
> >
> > - Use split barriers when possible. This helps the driver doing a more
> efficient job.
> >
> > We haven't considered the use of split barriers in wgpu-rs internally
> yet. It doesn't appear feasible when live-recording a command buffer
> (unlike the approach of encoding at submit time).
> >
> >
> > Thank you,
> >
> > Dzmitry
> >
> > On 5/6/19 2:03 PM, Ken Russell wrote:
> >> NVIDIA recently published an article "DX12 Do's and Don'ts" which is an
> interesting read. There are many, many details that should be managed
> correctly in order to achieve a performant DX12 application.
> >>
> >> https://developer.nvidia.com/dx12-dos-and-donts
> >>
> >> Some points made include:
> >>
> >>  - Expect to write a separate render path for each IHV minimum, because
> the app has to replace the driver's reasoning about how to efficiently
> drive the hardware.
> >>
> >>  - Compile pipeline state objects on background threads, because that's
> where shader compilation happens.
> >>
> >>  - Record command buffers on multiple threads in order to achieve best
> parallelism.
> >>
> >>  - Avoid changing the root signature unnecessarily as doing so is costly
> >>
> >> And many more.
> >>
> >> Any thoughts on this article and how WebGPU abstracts away any of its
> concerns?
> >>
> >> -Ken
> >>
> >
>
>
>
Attachments

application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Monday, 5 August 2019 23:18:31 UTC