- From: Chris Blume (ProgramMax) <programmax@gmail.com>
- Date: Wed, 24 Sep 2025 20:44:14 -0400
- To: "Portable Network Graphics (PNG) Working Group" <public-png@w3.org>
- Message-ID: <CAG3W2Kd3xhm5t3dMgW+uEe7VTHS8je9cvanNhQY+5yJQojeiWQ@mail.gmail.com>
Hello everyone, I just uploaded a first pass at a new tool to help us in our effort to improve PNG decoding, FileReadSpeedTest <https://github.com/ProgramMax/FileReadSpeedTest>. It works by finding the optimal buffer size for a given drive & file system, then loads a file buffer-by-buffer. It reports the time each buffer arrives. This allows us to "replay" file loading for performance testing. (The drive and OS will cache data, changing the load speeds & performance results. We can instead feed the buffers to the test rig at known good intervals to keep tests consistent.) This is how the hardware works under the hood. It does not load an entire file in one go. This also gives us a target for our decode speeds. In an ideal world, we can decode a buffer faster than the next buffer arrives. That would mean the decode speed is limited by the drive, not the format/algorithm. I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted). Windows reports the ideal buffer size is 4096 bytes. A buffer load took ~10 microseconds. So a simple, not-quite-accurate reaction is "Our format should decode 4096 bytes in ~10 microseconds for this machine". For a machine with a more normal drive, we'll have even more time. (In order to be more accurate, I'll need to also measure IO command queuing and OS call overhead. That'll come soon.) An important thing to note is this is a moving target. CPU speeds have leveled out, but drive speeds are still increasing. If we want to target crazy fast server hardware, that's a different target. Additionally, that 4096 bytes in ~10 microseconds assumes a single-threaded workload. If we were able to spread the work across 8 threads, we would have ~80 microseconds. But with command queuing, multiple threads worth of buffered data might arrive at nearly the same time, reducing the available per-thread decode time back down. I'll work on improving the tool and gathering more sample data for us to replay under various conditions. I'll also add Linux and Mac support when I can.
Received on Thursday, 25 September 2025 00:44:30 UTC