Re: [PNG] New tool to identify decode speed targets from Chris Blume (ProgramMax) on 2025-09-25 (public-png@w3.org from September 2025)

From: Chris Blume (ProgramMax) <programmax@gmail.com>
Date: Thu, 25 Sep 2025 15:57:28 -0400
To: "Portable Network Graphics (PNG) Working Group" <public-png@w3.org>
Message-ID: <CAG3W2KcMiyWKLRj-dmHuxVbtQXa_=MN32ZKhVUt=-2ht6STfcg@mail.gmail.com>

Following up on this:
- I now report the time taken at each step, which shows OS call overhead
and can show IO command queuing.
- The ~10 microseconds for 4096 bytes seemed *really* fast. Turns out, that
was OS caching.
- I am now using unbuffered IO, which circumvents OS caching. The new times
are ~300-400 ms for 4096 bytes, which is great news: That is a much easier
target to hit. And keep in mind, this is on a fast drive & interface.

Interesting side observations:
- When the OS cache is used, the benefit of overlapped IO is ~zero. You
might as well do sequential read requests. It's all just API call speed and
memory speed at that point.

On Wed, Sep 24, 2025 at 8:44 PM Chris Blume (ProgramMax) <
programmax@gmail.com> wrote:

> Hello everyone,
>
> I just uploaded a first pass at a new tool to help us in our effort to
> improve PNG decoding,  FileReadSpeedTest
> <https://github.com/ProgramMax/FileReadSpeedTest>.
>
> It works by finding the optimal buffer size for a given drive & file
> system, then loads a file buffer-by-buffer. It reports the time each buffer
> arrives. This allows us to "replay" file loading for performance testing.
> (The drive and OS will cache data, changing the load speeds & performance
> results. We can instead feed the buffers to the test rig at known good
> intervals to keep tests consistent.)
>
> This is how the hardware works under the hood. It does not load an entire
> file in one go. This also gives us a target for our decode speeds. In an
> ideal world, we can decode a buffer faster than the next buffer arrives.
> That would mean the decode speed is limited by the drive, not the
> format/algorithm.
>
> I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a
> seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able
> to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted).
> Windows reports the ideal buffer size is 4096 bytes.
>
> A buffer load took ~10 microseconds. So a simple, not-quite-accurate
> reaction is "Our format should decode 4096 bytes in ~10 microseconds for
> this machine".
> For a machine with a more normal drive, we'll have even more time.
>
> (In order to be more accurate, I'll need to also measure IO command
> queuing and OS call overhead. That'll come soon.)
>
> An important thing to note is this is a moving target. CPU speeds have
> leveled out, but drive speeds are still increasing. If we want to target
> crazy fast server hardware, that's a different target. Additionally, that
> 4096 bytes in ~10 microseconds assumes a single-threaded workload. If we
> were able to spread the work across 8 threads, we would have ~80
> microseconds. But with command queuing, multiple threads worth of buffered
> data might arrive at nearly the same time, reducing the available
> per-thread decode time back down.
>
> I'll work on improving the tool and gathering more sample data for us to
> replay under various conditions. I'll also add Linux and Mac support when I
> can.
>

Received on Thursday, 25 September 2025 19:57:45 UTC