- From: Chris Blume (ProgramMax) <programmax@gmail.com>
- Date: Sat, 27 Sep 2025 06:27:22 -0400
- To: Jonathan Behrens <fintelia@gmail.com>
- Cc: "Portable Network Graphics (PNG) Working Group" <public-png@w3.org>
- Message-ID: <CAG3W2Kf1_GQJtKbQNKr8oE=wpd06XDBMcz73KYrLQJw9MG+Tvw@mail.gmail.com>
Agreed. But I was using a fairly typically sized PNG. Larger test data would likely hit faster read speeds but wouldn't represent a typical PNG, IMHO. That said, the tool is open and you can run it on whatever test data you want. Actually, I'll want to do that anyway and keep a list of data->result. I suspect there are things my code is doing incorrectly, too. For example, right now my buffers are exactly the size Windows suggests is optimal. But I wonder if the buffers should be a multiple of that size. On Fri, Sep 26, 2025 at 3:28 PM Jonathan Behrens <fintelia@gmail.com> wrote: > I suspect larger reads and/or more concurrency could get much higher > bandwidth. 4096 bytes / 300 microseconds = 13 MiB/second or only a bit over > 100 Mbit/second. > > Jonathan > > On Thu, Sep 25, 2025 at 2:19 PM Chris Blume (ProgramMax) < > programmax@gmail.com> wrote: > >> Correction: I said "ms" when I meant "μs". >> Not ~300-400ms (which would be ~0.3 of a second). ~300-400 microseconds >> (~0.0003 of a second) >> >> On Thu, Sep 25, 2025 at 3:57 PM Chris Blume (ProgramMax) < >> programmax@gmail.com> wrote: >> >>> Following up on this: >>> - I now report the time taken at each step, which shows OS call overhead >>> and can show IO command queuing. >>> - The ~10 microseconds for 4096 bytes seemed *really* fast. Turns out, >>> that was OS caching. >>> - I am now using unbuffered IO, which circumvents OS caching. The new >>> times are ~300-400 ms for 4096 bytes, which is great news: That is a much >>> easier target to hit. And keep in mind, this is on a fast drive & interface. >>> >>> Interesting side observations: >>> - When the OS cache is used, the benefit of overlapped IO is ~zero. You >>> might as well do sequential read requests. It's all just API call speed and >>> memory speed at that point. >>> >>> On Wed, Sep 24, 2025 at 8:44 PM Chris Blume (ProgramMax) < >>> programmax@gmail.com> wrote: >>> >>>> Hello everyone, >>>> >>>> I just uploaded a first pass at a new tool to help us in our effort to >>>> improve PNG decoding, FileReadSpeedTest >>>> <https://github.com/ProgramMax/FileReadSpeedTest>. >>>> >>>> It works by finding the optimal buffer size for a given drive & file >>>> system, then loads a file buffer-by-buffer. It reports the time each buffer >>>> arrives. This allows us to "replay" file loading for performance testing. >>>> (The drive and OS will cache data, changing the load speeds & performance >>>> results. We can instead feed the buffers to the test rig at known good >>>> intervals to keep tests consistent.) >>>> >>>> This is how the hardware works under the hood. It does not load an >>>> entire file in one go. This also gives us a target for our decode speeds. >>>> In an ideal world, we can decode a buffer faster than the next buffer >>>> arrives. That would mean the decode speed is limited by the drive, not the >>>> format/algorithm. >>>> >>>> I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a >>>> seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able >>>> to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted). >>>> Windows reports the ideal buffer size is 4096 bytes. >>>> >>>> A buffer load took ~10 microseconds. So a simple, not-quite-accurate >>>> reaction is "Our format should decode 4096 bytes in ~10 microseconds for >>>> this machine". >>>> For a machine with a more normal drive, we'll have even more time. >>>> >>>> (In order to be more accurate, I'll need to also measure IO command >>>> queuing and OS call overhead. That'll come soon.) >>>> >>>> An important thing to note is this is a moving target. CPU speeds have >>>> leveled out, but drive speeds are still increasing. If we want to target >>>> crazy fast server hardware, that's a different target. Additionally, that >>>> 4096 bytes in ~10 microseconds assumes a single-threaded workload. If we >>>> were able to spread the work across 8 threads, we would have ~80 >>>> microseconds. But with command queuing, multiple threads worth of buffered >>>> data might arrive at nearly the same time, reducing the available >>>> per-thread decode time back down. >>>> >>>> I'll work on improving the tool and gathering more sample data for us >>>> to replay under various conditions. I'll also add Linux and Mac support >>>> when I can. >>>> >>>
Received on Saturday, 27 September 2025 10:27:37 UTC