Re: [PNG] Cancel upcoming meeting? from Chris Blume (ProgramMax) on 2025-05-09 (public-png@w3.org from May 2025)

From: Chris Blume (ProgramMax) <programmax@gmail.com>
Date: Fri, 9 May 2025 04:19:37 -0400
To: jbowler@acm.org, Pierre-Anthony Lemieux <pal@sandflow.com>
Cc: Randy <randy408@protonmail.com>, public-png@w3.org
Message-ID: <CAG3W2KeHGnUG1TctU5+02nos+OJawZW9qrJnE1SK2J3f-==F5A@mail.gmail.com>
@Pierre-Anthony, I've tagged you in a section below--highlighted in
purple--which
we might want to change about the image benchmarks.




> These timing figures are, however, very misleading.  In practice it is
> normally faster to compress data **on disk** and then spend the CPU
> time decompressing it than it is to read uncompressed data.


Agreed :)
I have a little rant about this on Twitter.

I tried explaining to the QOI creator that their stated goal of fast decode
is reasonable but I disagreed with their method and results. They load the
entire file and then time the decodes. Almost never do you have the entire
file in memory already. The only real cases I can think of are ROMs or if
the image is stored inside the executable. (Or maybe with OSes today, which
use spare memory to preload hot files. If you are loading the same image
over and over frequently, then maybe? But that is a far-fetched situation.)

If you instead decode as the packets/pages are loaded, the format which
compresses better will use fewer packets/pages and thus decode faster.
Unless decoding is *so slow* that storage/network is faster. But there are
several orders of magnitude in the CPU's favor here.

I think a better way to test would be to first write a program that records
and plays back the storage device's timing of data arrival. That way you
can simulate the real device's real behavior, removing inter-run file
caching by the device/OS. And then the test can read from memory, decoding
at that device's real pace. This gives a better measure of the format,
rather than measuring a bad implementation / usage.

I know Pierre-Anthony has an image benchmark. Perhaps we can work on this?

Granted, lots of code does this the easier way--loading the entire image in
one step, then decoding the entire image in one step. So it is a
common-enough scenario to care about. But the solution to that isn't "Which
format is faster?" The solution is fixing the bad implementation.

FWIW, Chrome decodes images as packets arrive.
But it does so with the granularity of a row of pixels, which is common in
many image decoding libraries.
So wide images might re-decode the same pixels several times (if there
wasn't enough data to finish the row).

***NOTE: All of this would undermine the value of parallel decoding.***

Except for animations in Chrome, which waits until the 2nd+ frame is
completely downloaded and then does a big decode. That is intentional: This
generally happens on page load, and we want to focus the CPU on the load.
(Although, a better implementation might be to prioritize tasks and put
2nd+ frame decode tasks as a lower priority. I don't have data on when a
low-priority task might slip through and block an important load task.)



Apple may care to comment.

...

which are often 4K square

or 8K square and for light domes which are often 16Kx8K and will
> certainly be larger in the future.


Apple made an iDOT chunk for parallel decoding. So I certainly would
appreciate their input. :)

The way I see it:

   - If decoding happens faster than data is loaded (and loading happens
   serially) then the image size shouldn't matter.
   - If decoding happens faster BUT there can be parallel streams*, this is
   a candidate for parallel decoding.
   - If decoding is not faster than data load**, candidate for parallel
   decoding.
   - If we accept that too many apps do it the bad way (one load call, one
   decode call)***, MAYBE this is a candidate for parallel decoding.

* I'm not sure if this is a thing that exists. Maybe NVMe drives can
service multiple requests across their multiple PCIe lanes and the OS
doesn't already do this? But if that's the case, it feels like the OS
should be fixed. So I don't think this is a great candidate.

** This is highly situational. Is the program using an optimized decoder
library? Is the device CPU highly constrained? We'll want to gather
situational data on decode speeds to try to plot out the Pareto front.

*** If apps aren't going to change to the new way, are they going to change
and use the parallel APIs? Surely, the library shouldn't create its own
threads. But heck, maybe if the situation is that dire. Or if the image
decode is an OS call and the OS already created worker threads.



 the real problem is that the technique involves a "full"
> flush and that destroys the state of the compressor;

...

Mark has this to say about using a
> "full" flush in https://www.zlib.net/manual.html:
>
> "Using Z_FULL_FLUSH too often can seriously degrade compression."


Right. But suppose an image is a solid background and then suddenly a
high-contrast-pixel-to-pixel logo.
The existing compressor state wouldn't have been helpful anyway when it got
to the logo. So might as well throw it out.

And if you follow that example, you can extend it with: any filter which
uses UP is strongly indicating that it reuses the compressor's state. If it
*doesn't* use UP there is still a chance, but a more awkward/forced one.
Like the repeated pattern can be found up & left 3 pixels. Or 2 rows up
(and hopefully that is still in the window).

Part of our data collection would likely be how strong is the correlation
between filter w/ UP and reusing compressor state. If we can show a strong
correlation, we probably pay very little penalty with the Z_FULL_FLUSH.



But this does not apply to **large** images to the same extent.

...
> It is most unlikely that the flush would make a significant effect on
> compression...


Agreed. :)



So in that limiting test using an image which is quite noisy the
> filter restriction cost me about 1% in compression.


That is WAYY better than I expected.
In my head, I was imagining people weighing 10% file size for 4x decode
speed.
Granted, it is a large image. But not *that* large today. And we aren't
resetting the compressor yet.
This is great news!
Especially given the UP & compressor reuse I expect.



...formalized in a public chunk by Apple...


I think iDOT is not formally specified.
I think it has been mostly (but not 100%) reverse engineered.

Unless Apple provides a formal specification, I think we would have to
reinvent a differently named chunk using that iDOT-like behavior.



To me that is an a priori demonstration
> of utility.  The basic requirement is very very simple and trivial to
> implement (at least with zlib)


Agreed.
It also carves a path for Area of Interest (AoI) / Region of Interest
(RoI), which I planned on proposing after the core idea of deflate reset &
marker was accepted.
Even if we decode faster than data loads and deem parallel to not be
fruitful, those concepts *are* still useful for AoI/RoI.


I can see no reason **not** to do this.


I'm also eager for it. :)
Especially since adding it to the Fourth Edition docket is just to work on
it, not guarantee it goes in.

But it is reasonable to want clear data on when it is and isn't beneficial.
And that data will certainly help us shape the proposal itself.
So I get the idea of not adding it to the docket until that data is ready.

On Thu, May 8, 2025 at 7:10 PM John Bowler <john.cunningham.bowler@gmail.com>
wrote:

> On Thu, May 8, 2025 at 11:47 AM Chris Blume (ProgramMax)
> <programmax@gmail.com> wrote:
> > And for data, how much speed can we gain for how much file size
> sacrifice? Is the speed gain because most of the time is spent in the
> inflate step? If we optimize the inflate, does that reduce the reward?
>
> Most of the **CPU** time required to decode a PNG goes inside zlib
> decompression.  This is based on many timing experiments using libpng
> and another liibrary written at the end of 1996.  Both used Mark
> Adler's zlib; zlib-ng is meant to be faster.  Within both libpng and,
> IRC,  the other library the only significant CPU consumer is the Paeth
> filter.  As a result the other library and, by default, libpng 1.7,
> disfavour it's use.
>
> These timing figures are, however, very misleading.  In practice it is
> normally faster to compress data **on disk** and then spend the CPU
> time decompressing it than it is to read uncompressed data.  This is
> **real** time of course; it does not appear in "user" time or, indeed,
> "system" time because those don't include time where the process is
> idle; waiting for stuff to be delivered off disk.  It may be that some
> modern file systems can reverse this conclusion but when I first
> tested this on a commercial system 30 years ago the results were very
> very skewed towards compressed data even when the time to compress it
> was taken into account!  For PNG we are talking about just the
> decompression side which is a lot faster than PNG (LZ) compression.
> (LZW is more balanced.)
>
> > I think the hesitation is we wanted to have data to show when it is
> good/bad, and by how much.
>
> Apple may care to comment.  When I was working on commercial software
> I chose to compress images in-memory _to save RAM_   PNG was the
> memory and file format for everything except JPEG and, later, GIF.  In
> this kind of architecture the full uncompressed image is never
> required.  I know that at least one commercial 3D rendering program
> can store texture maps compressed when rendering.  I believe this
> might actually happen in the NVidia Iray render engine with the
> compressed textures on the graphics card.  In all these cases there
> **might** be an advantage in parallel decode for large (think print
> resolution) images, for current texture maps which are often 4K square
> or 8K square and for light domes which are often 16Kx8K and will
> certainly be larger in the future.
>
> > For example, very wide images will have more data per row. That leaves
> fewer places for the new filter & restart marker.
>
> Conceivably someone might make a panorama that is not very high or
> have an existing set of images containing multiple frames which are
> organised horizontally but so what?  If this technique is applied to
> small images the real problem is that the technique involves a "full"
> flush and that destroys the state of the compressor; each "segment" of
> the image is encoded as a completely independent compressed stream.
> The overhead of the 4 byte marker is also not the issue for small
> images, it's the requirement at least in mARK for a new IDAT chunk;
> that alone has a 12 byte overhead.  Mark has this to say about using a
> "full" flush in https://www.zlib.net/manual.html:
>
> "Using Z_FULL_FLUSH too often can seriously degrade compression."
>
> But this does not apply to **large** images to the same extent.  To
> use my example of a 4K texture there are 4096 lines and therefore
> there can be 4096 segments.  A typical texture is 24-bit RGB so each
> line has 12289 bytes and the deflate algorithm uses a sliding window
> of at most 32768 bytes.  Consequently dividing a 4K texture into, say,
> 64 (offering the use of 64 cores at once) gives compressed blocks
> corresponding to over 24 window-widths.  It is most unlikely that the
> flush would make a significant effect on compression and an extra 16
> bytes per segment certainly won't.
>
> This is, in fact, measurable even without an implementation; it should
> be a relatively simple hack to any encoder that uses zlib or a similar
> implementation.  If the encoder is a library (libpng, libspng etc)
> then the result will change the output of all programs that use the
> library.  Even commercial software could be tested so long as the
> implementation uses a system DLL.
>
> >And it might mean forcing none/sub filters has a bigger impact on file
> size. OR maybe the bigger rows means the filter was more likely to be none
> anyway.
>
> That can be tested in the same way but it's also possible to examine
> the benefit of each filter on real world PNG files by using a PNG
> optimiser which allows the filters to be specified (probably any of
> them!)  This works because the filter choice only affects the specific
> row, so it is valid to try optimising first with all filters allowed
> then with just NONE+SUB and finally check the two sizes; this is the
> limiting case though, of course, it is not necessarily the worst case
> given that encoders may make just plain bad choices of filters!
>
> I tried oxipng on a 4K (2304x4096) rendered image in 32-bit RGBA format:
>
> `oxipng -P --interlace keep --filters <filters> -nx --zopfli <image>`
>
> Using <filters> of either 0,1,2,3,4,9 (the '9' argument minimizes the
> size) or '0,1,9' and got these results:
>
> Raw image bytes: 37'748'736 bytes (100%)
> Original: 16632959 bytes (44.1%)
> All filters: 16223712 bytes (43.0%)
> NONE+SUB: (44.0%)
>
> So in that limiting test using an image which is quite noisy the
> filter restriction cost me about 1% in compression.
>
> At this point my own feeling is that the technique has already been
> adopted and, apparently, formalized in a public chunk by Apple and,
> possibly, libspng (it is not clear to me if Randy has published his
> implementation in libspng).  To me that is an a priori demonstration
> of utility.  The basic requirement is very very simple and trivial to
> implement (at least with zlib); it is the location of the 4-byte
> marker (or, as in mARK, the point immediately after a restart marker,
> though that isn't completely clear) followed by a deflate block which
> starts with a clear state (no reliance on prior data).
>
> iDOT uses a file offset and therefore permits random access at the
> file level to "segments" (using Randy's terminology).  From this I
> deduce that random access is an Apple requirement but, of course,
> Apple should comment.  mARK accommodates this but also allows an IDAT
> chunk offset which is more robust but requires searching forward
> through the IDAT chunks (they do not have to be read beyond the 4-byte
> length).  So at this point reconciliation is required or, of course,
> adopting both chunks verbatim (if Apple provide the words...)
>
> I can see no reason **not** to do this.  Apple have already
> commandeered the iDOT chunk, so no loss in formalising it because it
> is already entirely ignorable.   It's not clear if mARK is present in
> publicly readable files at this point; if it were I would say the same
> thing, if not it seems to superset the iDOT capability with a more
> robust though maybe slightly slower solution.  Maybe Apple could be
> persuaded to adopt mARK?
>
> John Bowler
>
Received on Friday, 9 May 2025 08:19:54 UTC