Re: [PNG] New tool to identify decode speed targets from Chris Blume (ProgramMax) on 2025-09-27 (public-png@w3.org from September 2025)

From: Chris Blume (ProgramMax) <programmax@gmail.com>
Date: Sat, 27 Sep 2025 06:27:22 -0400
To: Jonathan Behrens <fintelia@gmail.com>
Cc: "Portable Network Graphics (PNG) Working Group" <public-png@w3.org>
Message-ID: <CAG3W2Kf1_GQJtKbQNKr8oE=wpd06XDBMcz73KYrLQJw9MG+Tvw@mail.gmail.com>
Agreed. But I was using a fairly typically sized PNG. Larger test data
would likely hit faster read speeds but wouldn't represent a typical PNG,
IMHO.

That said, the tool is open and you can run it on whatever test data you
want.

Actually, I'll want to do that anyway and keep a list of data->result.
I suspect there are things my code is doing incorrectly, too. For example,
right now my buffers are exactly the size Windows suggests is optimal. But
I wonder if the buffers should be a multiple of that size.

On Fri, Sep 26, 2025 at 3:28 PM Jonathan Behrens <fintelia@gmail.com> wrote:

> I suspect larger reads and/or more concurrency could get much higher
> bandwidth. 4096 bytes / 300 microseconds = 13 MiB/second or only a bit over
> 100 Mbit/second.
>
> Jonathan
>
> On Thu, Sep 25, 2025 at 2:19 PM Chris Blume (ProgramMax) <
> programmax@gmail.com> wrote:
>
>> Correction: I said "ms" when I meant "μs".
>> Not ~300-400ms (which would be ~0.3 of a second). ~300-400 microseconds
>> (~0.0003 of a second)
>>
>> On Thu, Sep 25, 2025 at 3:57 PM Chris Blume (ProgramMax) <
>> programmax@gmail.com> wrote:
>>
>>> Following up on this:
>>> - I now report the time taken at each step, which shows OS call overhead
>>> and can show IO command queuing.
>>> - The ~10 microseconds for 4096 bytes seemed *really* fast. Turns out,
>>> that was OS caching.
>>> - I am now using unbuffered IO, which circumvents OS caching. The new
>>> times are ~300-400 ms for 4096 bytes, which is great news: That is a much
>>> easier target to hit. And keep in mind, this is on a fast drive & interface.
>>>
>>> Interesting side observations:
>>> - When the OS cache is used, the benefit of overlapped IO is ~zero. You
>>> might as well do sequential read requests. It's all just API call speed and
>>> memory speed at that point.
>>>
>>> On Wed, Sep 24, 2025 at 8:44 PM Chris Blume (ProgramMax) <
>>> programmax@gmail.com> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I just uploaded a first pass at a new tool to help us in our effort to
>>>> improve PNG decoding,  FileReadSpeedTest
>>>> <https://github.com/ProgramMax/FileReadSpeedTest>.
>>>>
>>>> It works by finding the optimal buffer size for a given drive & file
>>>> system, then loads a file buffer-by-buffer. It reports the time each buffer
>>>> arrives. This allows us to "replay" file loading for performance testing.
>>>> (The drive and OS will cache data, changing the load speeds & performance
>>>> results. We can instead feed the buffers to the test rig at known good
>>>> intervals to keep tests consistent.)
>>>>
>>>> This is how the hardware works under the hood. It does not load an
>>>> entire file in one go. This also gives us a target for our decode speeds.
>>>> In an ideal world, we can decode a buffer faster than the next buffer
>>>> arrives. That would mean the decode speed is limited by the drive, not the
>>>> format/algorithm.
>>>>
>>>> I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a
>>>> seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able
>>>> to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted).
>>>> Windows reports the ideal buffer size is 4096 bytes.
>>>>
>>>> A buffer load took ~10 microseconds. So a simple, not-quite-accurate
>>>> reaction is "Our format should decode 4096 bytes in ~10 microseconds for
>>>> this machine".
>>>> For a machine with a more normal drive, we'll have even more time.
>>>>
>>>> (In order to be more accurate, I'll need to also measure IO command
>>>> queuing and OS call overhead. That'll come soon.)
>>>>
>>>> An important thing to note is this is a moving target. CPU speeds have
>>>> leveled out, but drive speeds are still increasing. If we want to target
>>>> crazy fast server hardware, that's a different target. Additionally, that
>>>> 4096 bytes in ~10 microseconds assumes a single-threaded workload. If we
>>>> were able to spread the work across 8 threads, we would have ~80
>>>> microseconds. But with command queuing, multiple threads worth of buffered
>>>> data might arrive at nearly the same time, reducing the available
>>>> per-thread decode time back down.
>>>>
>>>> I'll work on improving the tool and gathering more sample data for us
>>>> to replay under various conditions. I'll also add Linux and Mac support
>>>> when I can.
>>>>
>>>
Received on Saturday, 27 September 2025 10:27:37 UTC