Re: [PNG] Cancel upcoming meeting? from Chris Blume (ProgramMax) on 2025-05-09 (public-png@w3.org from May 2025)

From: Chris Blume (ProgramMax) <programmax@gmail.com>
Date: Fri, 9 May 2025 06:34:17 -0400
To: Leo Barnes <barnes.leo@gmail.com>
Cc: jbowler@acm.org, Pierre-Anthony Lemieux <pal@sandflow.com>, Randy <randy408@protonmail.com>, public-png@w3.org
Message-ID: <CAG3W2KfmQWV2uMvDDsA_ki6_0mJj7PUmaXN2tPN=Zsd1Bz_q0w@mail.gmail.com>
That's a good point.
Since storage access will be slower, it's useful to keep the compressed
image prefetched & ready in memory.
And if you had infinite memory you might pre-decode them. But since that
isn't reality, there is some limit on how much you want to pre-decode.

At which point this parallel decoding becomes very useful again :D

On Fri, May 9, 2025 at 5:01 AM Leo Barnes <barnes.leo@gmail.com> wrote:

> > If you instead decode as the packets/pages are loaded, the format which
> compresses better will use fewer packets/pages and thus decode faster.
> Unless decoding is *so slow* that storage/network is faster. But there are
> several orders of magnitude in the CPU's favor here.
>
> This makes a bunch of assumptions that may be valid for web usage. For the
> web, the transfer speed may indeed be the limiting factor.
>
> But for generic OS frameworks I would say it's pretty common to have image
> files already fully loaded in memory. At which point the decoding
> speed is the limiting factor and multithreading ends up being very
> important.
>
> As an example take the generic grid view of images. Say you have 10000
> images in the grid, but only N are actually visible on the screen at
> any given time. The fully decoded images are much larger to keep in memory
> than the compressed image files. If you want to  optimize for
> scroll speed *and* memory you would do something like this:
> - Keep 5 * N image files loaded into memory
> - Decode the visible N image files and show those on screen
> - (optional) Preemptively decode the next image to be displayed if the
> user scrolls in either direction
>
> Cheers,
> //Leo
>
> On Fri, 9 May 2025 at 10:20, Chris Blume (ProgramMax) <
> programmax@gmail.com> wrote:
>
>> @Pierre-Anthony, I've tagged you in a section below--highlighted in
>> purple--which we might want to change about the image benchmarks.
>>
>>
>>
>>
>>> These timing figures are, however, very misleading.  In practice it is
>>> normally faster to compress data **on disk** and then spend the CPU
>>> time decompressing it than it is to read uncompressed data.
>>
>>
>> Agreed :)
>> I have a little rant about this on Twitter.
>>
>> I tried explaining to the QOI creator that their stated goal of fast
>> decode is reasonable but I disagreed with their method and results. They
>> load the entire file and then time the decodes. Almost never do you have
>> the entire file in memory already. The only real cases I can think of are
>> ROMs or if the image is stored inside the executable. (Or maybe with OSes
>> today, which use spare memory to preload hot files. If you are loading the
>> same image over and over frequently, then maybe? But that is a far-fetched
>> situation.)
>>
>> If you instead decode as the packets/pages are loaded, the format which
>> compresses better will use fewer packets/pages and thus decode faster.
>> Unless decoding is *so slow* that storage/network is faster. But there are
>> several orders of magnitude in the CPU's favor here.
>>
>> I think a better way to test would be to first write a program that
>> records and plays back the storage device's timing of data arrival. That
>> way you can simulate the real device's real behavior, removing inter-run
>> file caching by the device/OS. And then the test can read from memory,
>> decoding at that device's real pace. This gives a better measure of the
>> format, rather than measuring a bad implementation / usage.
>>
>> I know Pierre-Anthony has an image benchmark. Perhaps we can work on this?
>>
>> Granted, lots of code does this the easier way--loading the entire image
>> in one step, then decoding the entire image in one step. So it is a
>> common-enough scenario to care about. But the solution to that isn't "Which
>> format is faster?" The solution is fixing the bad implementation.
>>
>> FWIW, Chrome decodes images as packets arrive.
>> But it does so with the granularity of a row of pixels, which is common
>> in many image decoding libraries.
>> So wide images might re-decode the same pixels several times (if there
>> wasn't enough data to finish the row).
>>
>> ***NOTE: All of this would undermine the value of parallel decoding.***
>>
>> Except for animations in Chrome, which waits until the 2nd+ frame is
>> completely downloaded and then does a big decode. That is intentional: This
>> generally happens on page load, and we want to focus the CPU on the load.
>> (Although, a better implementation might be to prioritize tasks and put
>> 2nd+ frame decode tasks as a lower priority. I don't have data on when a
>> low-priority task might slip through and block an important load task.)
>>
>>
>>
>> Apple may care to comment.
>>
>> ...
>>
>> which are often 4K square
>>
>> or 8K square and for light domes which are often 16Kx8K and will
>>> certainly be larger in the future.
>>
>>
>> Apple made an iDOT chunk for parallel decoding. So I certainly would
>> appreciate their input. :)
>>
>> The way I see it:
>>
>>    - If decoding happens faster than data is loaded (and loading happens
>>    serially) then the image size shouldn't matter.
>>    - If decoding happens faster BUT there can be parallel streams*, this
>>    is a candidate for parallel decoding.
>>    - If decoding is not faster than data load**, candidate for parallel
>>    decoding.
>>    - If we accept that too many apps do it the bad way (one load call,
>>    one decode call)***, MAYBE this is a candidate for parallel decoding.
>>
>> * I'm not sure if this is a thing that exists. Maybe NVMe drives can
>> service multiple requests across their multiple PCIe lanes and the OS
>> doesn't already do this? But if that's the case, it feels like the OS
>> should be fixed. So I don't think this is a great candidate.
>>
>> ** This is highly situational. Is the program using an optimized decoder
>> library? Is the device CPU highly constrained? We'll want to gather
>> situational data on decode speeds to try to plot out the Pareto front.
>>
>> *** If apps aren't going to change to the new way, are they going to
>> change and use the parallel APIs? Surely, the library shouldn't create its
>> own threads. But heck, maybe if the situation is that dire. Or if the image
>> decode is an OS call and the OS already created worker threads.
>>
>>
>>
>>  the real problem is that the technique involves a "full"
>>> flush and that destroys the state of the compressor;
>>
>> ...
>>
>> Mark has this to say about using a
>>> "full" flush in https://www.zlib.net/manual.html:
>>>
>>> "Using Z_FULL_FLUSH too often can seriously degrade compression."
>>
>>
>> Right. But suppose an image is a solid background and then suddenly a
>> high-contrast-pixel-to-pixel logo.
>> The existing compressor state wouldn't have been helpful anyway when it
>> got to the logo. So might as well throw it out.
>>
>> And if you follow that example, you can extend it with: any filter which
>> uses UP is strongly indicating that it reuses the compressor's state. If it
>> *doesn't* use UP there is still a chance, but a more awkward/forced one.
>> Like the repeated pattern can be found up & left 3 pixels. Or 2 rows up
>> (and hopefully that is still in the window).
>>
>> Part of our data collection would likely be how strong is the correlation
>> between filter w/ UP and reusing compressor state. If we can show a strong
>> correlation, we probably pay very little penalty with the Z_FULL_FLUSH.
>>
>>
>>
>> But this does not apply to **large** images to the same extent.
>>
>> ...
>>> It is most unlikely that the flush would make a significant effect on
>>> compression...
>>
>>
>> Agreed. :)
>>
>>
>>
>> So in that limiting test using an image which is quite noisy the
>>> filter restriction cost me about 1% in compression.
>>
>>
>> That is WAYY better than I expected.
>> In my head, I was imagining people weighing 10% file size for 4x decode
>> speed.
>> Granted, it is a large image. But not *that* large today. And we aren't
>> resetting the compressor yet.
>> This is great news!
>> Especially given the UP & compressor reuse I expect.
>>
>>
>>
>> ...formalized in a public chunk by Apple...
>>
>>
>> I think iDOT is not formally specified.
>> I think it has been mostly (but not 100%) reverse engineered.
>>
>> Unless Apple provides a formal specification, I think we would have to
>> reinvent a differently named chunk using that iDOT-like behavior.
>>
>>
>>
>> To me that is an a priori demonstration
>>> of utility.  The basic requirement is very very simple and trivial to
>>> implement (at least with zlib)
>>
>>
>> Agreed.
>> It also carves a path for Area of Interest (AoI) / Region of Interest
>> (RoI), which I planned on proposing after the core idea of deflate reset &
>> marker was accepted.
>> Even if we decode faster than data loads and deem parallel to not be
>> fruitful, those concepts *are* still useful for AoI/RoI.
>>
>>
>> I can see no reason **not** to do this.
>>
>>
>> I'm also eager for it. :)
>> Especially since adding it to the Fourth Edition docket is just to work
>> on it, not guarantee it goes in.
>>
>> But it is reasonable to want clear data on when it is and isn't
>> beneficial. And that data will certainly help us shape the proposal itself.
>> So I get the idea of not adding it to the docket until that data is ready.
>>
>> On Thu, May 8, 2025 at 7:10 PM John Bowler <
>> john.cunningham.bowler@gmail.com> wrote:
>>
>>> On Thu, May 8, 2025 at 11:47 AM Chris Blume (ProgramMax)
>>> <programmax@gmail.com> wrote:
>>> > And for data, how much speed can we gain for how much file size
>>> sacrifice? Is the speed gain because most of the time is spent in the
>>> inflate step? If we optimize the inflate, does that reduce the reward?
>>>
>>> Most of the **CPU** time required to decode a PNG goes inside zlib
>>> decompression.  This is based on many timing experiments using libpng
>>> and another liibrary written at the end of 1996.  Both used Mark
>>> Adler's zlib; zlib-ng is meant to be faster.  Within both libpng and,
>>> IRC,  the other library the only significant CPU consumer is the Paeth
>>> filter.  As a result the other library and, by default, libpng 1.7,
>>> disfavour it's use.
>>>
>>> These timing figures are, however, very misleading.  In practice it is
>>> normally faster to compress data **on disk** and then spend the CPU
>>> time decompressing it than it is to read uncompressed data.  This is
>>> **real** time of course; it does not appear in "user" time or, indeed,
>>> "system" time because those don't include time where the process is
>>> idle; waiting for stuff to be delivered off disk.  It may be that some
>>> modern file systems can reverse this conclusion but when I first
>>> tested this on a commercial system 30 years ago the results were very
>>> very skewed towards compressed data even when the time to compress it
>>> was taken into account!  For PNG we are talking about just the
>>> decompression side which is a lot faster than PNG (LZ) compression.
>>> (LZW is more balanced.)
>>>
>>> > I think the hesitation is we wanted to have data to show when it is
>>> good/bad, and by how much.
>>>
>>> Apple may care to comment.  When I was working on commercial software
>>> I chose to compress images in-memory _to save RAM_   PNG was the
>>> memory and file format for everything except JPEG and, later, GIF.  In
>>> this kind of architecture the full uncompressed image is never
>>> required.  I know that at least one commercial 3D rendering program
>>> can store texture maps compressed when rendering.  I believe this
>>> might actually happen in the NVidia Iray render engine with the
>>> compressed textures on the graphics card.  In all these cases there
>>> **might** be an advantage in parallel decode for large (think print
>>> resolution) images, for current texture maps which are often 4K square
>>> or 8K square and for light domes which are often 16Kx8K and will
>>> certainly be larger in the future.
>>>
>>> > For example, very wide images will have more data per row. That leaves
>>> fewer places for the new filter & restart marker.
>>>
>>> Conceivably someone might make a panorama that is not very high or
>>> have an existing set of images containing multiple frames which are
>>> organised horizontally but so what?  If this technique is applied to
>>> small images the real problem is that the technique involves a "full"
>>> flush and that destroys the state of the compressor; each "segment" of
>>> the image is encoded as a completely independent compressed stream.
>>> The overhead of the 4 byte marker is also not the issue for small
>>> images, it's the requirement at least in mARK for a new IDAT chunk;
>>> that alone has a 12 byte overhead.  Mark has this to say about using a
>>> "full" flush in https://www.zlib.net/manual.html:
>>>
>>> "Using Z_FULL_FLUSH too often can seriously degrade compression."
>>>
>>> But this does not apply to **large** images to the same extent.  To
>>> use my example of a 4K texture there are 4096 lines and therefore
>>> there can be 4096 segments.  A typical texture is 24-bit RGB so each
>>> line has 12289 bytes and the deflate algorithm uses a sliding window
>>> of at most 32768 bytes.  Consequently dividing a 4K texture into, say,
>>> 64 (offering the use of 64 cores at once) gives compressed blocks
>>> corresponding to over 24 window-widths.  It is most unlikely that the
>>> flush would make a significant effect on compression and an extra 16
>>> bytes per segment certainly won't.
>>>
>>> This is, in fact, measurable even without an implementation; it should
>>> be a relatively simple hack to any encoder that uses zlib or a similar
>>> implementation.  If the encoder is a library (libpng, libspng etc)
>>> then the result will change the output of all programs that use the
>>> library.  Even commercial software could be tested so long as the
>>> implementation uses a system DLL.
>>>
>>> >And it might mean forcing none/sub filters has a bigger impact on file
>>> size. OR maybe the bigger rows means the filter was more likely to be none
>>> anyway.
>>>
>>> That can be tested in the same way but it's also possible to examine
>>> the benefit of each filter on real world PNG files by using a PNG
>>> optimiser which allows the filters to be specified (probably any of
>>> them!)  This works because the filter choice only affects the specific
>>> row, so it is valid to try optimising first with all filters allowed
>>> then with just NONE+SUB and finally check the two sizes; this is the
>>> limiting case though, of course, it is not necessarily the worst case
>>> given that encoders may make just plain bad choices of filters!
>>>
>>> I tried oxipng on a 4K (2304x4096) rendered image in 32-bit RGBA format:
>>>
>>> `oxipng -P --interlace keep --filters <filters> -nx --zopfli <image>`
>>>
>>> Using <filters> of either 0,1,2,3,4,9 (the '9' argument minimizes the
>>> size) or '0,1,9' and got these results:
>>>
>>> Raw image bytes: 37'748'736 bytes (100%)
>>> Original: 16632959 bytes (44.1%)
>>> All filters: 16223712 bytes (43.0%)
>>> NONE+SUB: (44.0%)
>>>
>>> So in that limiting test using an image which is quite noisy the
>>> filter restriction cost me about 1% in compression.
>>>
>>> At this point my own feeling is that the technique has already been
>>> adopted and, apparently, formalized in a public chunk by Apple and,
>>> possibly, libspng (it is not clear to me if Randy has published his
>>> implementation in libspng).  To me that is an a priori demonstration
>>> of utility.  The basic requirement is very very simple and trivial to
>>> implement (at least with zlib); it is the location of the 4-byte
>>> marker (or, as in mARK, the point immediately after a restart marker,
>>> though that isn't completely clear) followed by a deflate block which
>>> starts with a clear state (no reliance on prior data).
>>>
>>> iDOT uses a file offset and therefore permits random access at the
>>> file level to "segments" (using Randy's terminology).  From this I
>>> deduce that random access is an Apple requirement but, of course,
>>> Apple should comment.  mARK accommodates this but also allows an IDAT
>>> chunk offset which is more robust but requires searching forward
>>> through the IDAT chunks (they do not have to be read beyond the 4-byte
>>> length).  So at this point reconciliation is required or, of course,
>>> adopting both chunks verbatim (if Apple provide the words...)
>>>
>>> I can see no reason **not** to do this.  Apple have already
>>> commandeered the iDOT chunk, so no loss in formalising it because it
>>> is already entirely ignorable.   It's not clear if mARK is present in
>>> publicly readable files at this point; if it were I would say the same
>>> thing, if not it seems to superset the iDOT capability with a more
>>> robust though maybe slightly slower solution.  Maybe Apple could be
>>> persuaded to adopt mARK?
>>>
>>> John Bowler
>>>
>>
Received on Friday, 9 May 2025 10:34:33 UTC