[PNG] New tool to identify decode speed targets from Chris Blume (ProgramMax) on 2025-09-25 (public-png@w3.org from September 2025)

From: Chris Blume (ProgramMax) <programmax@gmail.com>
Date: Wed, 24 Sep 2025 20:44:14 -0400
To: "Portable Network Graphics (PNG) Working Group" <public-png@w3.org>
Message-ID: <CAG3W2Kd3xhm5t3dMgW+uEe7VTHS8je9cvanNhQY+5yJQojeiWQ@mail.gmail.com>

Hello everyone,

I just uploaded a first pass at a new tool to help us in our effort to
improve PNG decoding,  FileReadSpeedTest
<https://github.com/ProgramMax/FileReadSpeedTest>.

It works by finding the optimal buffer size for a given drive & file
system, then loads a file buffer-by-buffer. It reports the time each buffer
arrives. This allows us to "replay" file loading for performance testing.
(The drive and OS will cache data, changing the load speeds & performance
results. We can instead feed the buffers to the test rig at known good
intervals to keep tests consistent.)

This is how the hardware works under the hood. It does not load an entire
file in one go. This also gives us a target for our decode speeds. In an
ideal world, we can decode a buffer faster than the next buffer arrives.
That would mean the decode speed is limited by the drive, not the
format/algorithm.

I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a
seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able
to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted).
Windows reports the ideal buffer size is 4096 bytes.

A buffer load took ~10 microseconds. So a simple, not-quite-accurate
reaction is "Our format should decode 4096 bytes in ~10 microseconds for
this machine".
For a machine with a more normal drive, we'll have even more time.

(In order to be more accurate, I'll need to also measure IO command queuing
and OS call overhead. That'll come soon.)

An important thing to note is this is a moving target. CPU speeds have
leveled out, but drive speeds are still increasing. If we want to target
crazy fast server hardware, that's a different target. Additionally, that
4096 bytes in ~10 microseconds assumes a single-threaded workload. If we
were able to spread the work across 8 threads, we would have ~80
microseconds. But with command queuing, multiple threads worth of buffered
data might arrive at nearly the same time, reducing the available
per-thread decode time back down.

I'll work on improving the tool and gathering more sample data for us to
replay under various conditions. I'll also add Linux and Mac support when I
can.

Received on Thursday, 25 September 2025 00:44:30 UTC