- From: Jonathan Behrens <fintelia@gmail.com>
- Date: Fri, 26 Sep 2025 12:28:04 -0700
- To: "Chris Blume (ProgramMax)" <programmax@gmail.com>
- Cc: "Portable Network Graphics (PNG) Working Group" <public-png@w3.org>
- Message-ID: <CANnJOVH6RGo_-f43R+2p7grGbAY-PTaEjySMjB-6CjFCvGWSZw@mail.gmail.com>
I suspect larger reads and/or more concurrency could get much higher bandwidth. 4096 bytes / 300 microseconds = 13 MiB/second or only a bit over 100 Mbit/second. Jonathan On Thu, Sep 25, 2025 at 2:19 PM Chris Blume (ProgramMax) < programmax@gmail.com> wrote: > Correction: I said "ms" when I meant "μs". > Not ~300-400ms (which would be ~0.3 of a second). ~300-400 microseconds > (~0.0003 of a second) > > On Thu, Sep 25, 2025 at 3:57 PM Chris Blume (ProgramMax) < > programmax@gmail.com> wrote: > >> Following up on this: >> - I now report the time taken at each step, which shows OS call overhead >> and can show IO command queuing. >> - The ~10 microseconds for 4096 bytes seemed *really* fast. Turns out, >> that was OS caching. >> - I am now using unbuffered IO, which circumvents OS caching. The new >> times are ~300-400 ms for 4096 bytes, which is great news: That is a much >> easier target to hit. And keep in mind, this is on a fast drive & interface. >> >> Interesting side observations: >> - When the OS cache is used, the benefit of overlapped IO is ~zero. You >> might as well do sequential read requests. It's all just API call speed and >> memory speed at that point. >> >> On Wed, Sep 24, 2025 at 8:44 PM Chris Blume (ProgramMax) < >> programmax@gmail.com> wrote: >> >>> Hello everyone, >>> >>> I just uploaded a first pass at a new tool to help us in our effort to >>> improve PNG decoding, FileReadSpeedTest >>> <https://github.com/ProgramMax/FileReadSpeedTest>. >>> >>> It works by finding the optimal buffer size for a given drive & file >>> system, then loads a file buffer-by-buffer. It reports the time each buffer >>> arrives. This allows us to "replay" file loading for performance testing. >>> (The drive and OS will cache data, changing the load speeds & performance >>> results. We can instead feed the buffers to the test rig at known good >>> intervals to keep tests consistent.) >>> >>> This is how the hardware works under the hood. It does not load an >>> entire file in one go. This also gives us a target for our decode speeds. >>> In an ideal world, we can decode a buffer faster than the next buffer >>> arrives. That would mean the decode speed is limited by the drive, not the >>> format/algorithm. >>> >>> I tested on my laptop, which has a WD_BLACK SN770 2TB drive. That is a >>> seriously fast drive. Advertised speeds are "up to 5,150 MB/s". I was able >>> to reach 5,259 MB/s. It is formatted with NTFS (BitLocker encrypted). >>> Windows reports the ideal buffer size is 4096 bytes. >>> >>> A buffer load took ~10 microseconds. So a simple, not-quite-accurate >>> reaction is "Our format should decode 4096 bytes in ~10 microseconds for >>> this machine". >>> For a machine with a more normal drive, we'll have even more time. >>> >>> (In order to be more accurate, I'll need to also measure IO command >>> queuing and OS call overhead. That'll come soon.) >>> >>> An important thing to note is this is a moving target. CPU speeds have >>> leveled out, but drive speeds are still increasing. If we want to target >>> crazy fast server hardware, that's a different target. Additionally, that >>> 4096 bytes in ~10 microseconds assumes a single-threaded workload. If we >>> were able to spread the work across 8 threads, we would have ~80 >>> microseconds. But with command queuing, multiple threads worth of buffered >>> data might arrive at nearly the same time, reducing the available >>> per-thread decode time back down. >>> >>> I'll work on improving the tool and gathering more sample data for us to >>> replay under various conditions. I'll also add Linux and Mac support when I >>> can. >>> >>
Received on Monday, 29 September 2025 09:25:58 UTC