Re: [PNG] New tool to identify decode speed targets

I suspect larger reads and/or more concurrency could get much higher
bandwidth. 4096 bytes / 300 microseconds = 13 MiB/second or only a bit over
100 Mbit/second.

Jonathan

On Thu, Sep 25, 2025 at 2:19 PM Chris Blume (ProgramMax) <
programmax@gmail.com> wrote:

> Correction: I said "ms" when I meant "μs".
> Not ~300-400ms (which would be ~0.3 of a second). ~300-400 microseconds
> (~0.0003 of a second)
>
> On Thu, Sep 25, 2025 at 3:57 PM Chris Blume (ProgramMax) <
> programmax@gmail.com> wrote:
>
>> Following up on this:
>> - I now report the time taken at each step, which shows OS call overhead
>> and can show IO command queuing.
>> - The ~10 microseconds for 4096 bytes seemed *really* fast. Turns out,
>> that was OS caching.
>> - I am now using unbuffered IO, which circumvents OS caching. The new
>> times are ~300-400 ms for 4096 bytes, which is great news: That is a much
>> easier target to hit. And keep in mind, this is on a fast drive & interface.
>>
>> Interesting side observations:
>> - When the OS cache is used, the benefit of overlapped IO is ~zero. You
>> might as well do sequential read requests. It's all just API call speed and
>> memory speed at that point.
>>
>> On Wed, Sep 24, 2025 at 8:44 PM Chris Blume (ProgramMax) <
>> programmax@gmail.com> wrote:
>>
>>> Hello everyone,
>>>
>>> I just uploaded a first pass at a new tool to help us in our effort to
>>> improve PNG decoding,  FileReadSpeedTest
>>> <https://github.com/ProgramMax/FileReadSpeedTest>.
>>>
>>> It works by finding the optimal buffer size for a given drive & file
>>> system, then loads a file buffer-by-buffer. It reports the time each buffer
>>> arrives. This allows us to "replay" file loading for performance testing.
>>> (The drive and OS will cache data, changing the load speeds & performance
>>> results. We can instead feed the buffers to the test rig at known good
>>> intervals to keep tests consistent.)
>>>
>>> This is how the hardware works under the hood. It does not load an
>>> entire file in one go. This also gives us a target for our decode speeds.
>>> In an ideal world, we can decode a buffer faster than the next buffer
>>> arrives. That would mean the decode speed is limited by the drive, not the
>>> format/algorithm.
>>>
>>> I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a
>>> seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able
>>> to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted).
>>> Windows reports the ideal buffer size is 4096 bytes.
>>>
>>> A buffer load took ~10 microseconds. So a simple, not-quite-accurate
>>> reaction is "Our format should decode 4096 bytes in ~10 microseconds for
>>> this machine".
>>> For a machine with a more normal drive, we'll have even more time.
>>>
>>> (In order to be more accurate, I'll need to also measure IO command
>>> queuing and OS call overhead. That'll come soon.)
>>>
>>> An important thing to note is this is a moving target. CPU speeds have
>>> leveled out, but drive speeds are still increasing. If we want to target
>>> crazy fast server hardware, that's a different target. Additionally, that
>>> 4096 bytes in ~10 microseconds assumes a single-threaded workload. If we
>>> were able to spread the work across 8 threads, we would have ~80
>>> microseconds. But with command queuing, multiple threads worth of buffered
>>> data might arrive at nearly the same time, reducing the available
>>> per-thread decode time back down.
>>>
>>> I'll work on improving the tool and gathering more sample data for us to
>>> replay under various conditions. I'll also add Linux and Mac support when I
>>> can.
>>>
>>

Received on Monday, 29 September 2025 09:25:58 UTC