Re: [PNG] Cancel upcoming meeting?

On Thu, May 8, 2025 at 11:47 AM Chris Blume (ProgramMax)
<programmax@gmail.com> wrote:
> And for data, how much speed can we gain for how much file size sacrifice? Is the speed gain because most of the time is spent in the inflate step? If we optimize the inflate, does that reduce the reward?

Most of the **CPU** time required to decode a PNG goes inside zlib
decompression.  This is based on many timing experiments using libpng
and another liibrary written at the end of 1996.  Both used Mark
Adler's zlib; zlib-ng is meant to be faster.  Within both libpng and,
IRC,  the other library the only significant CPU consumer is the Paeth
filter.  As a result the other library and, by default, libpng 1.7,
disfavour it's use.

These timing figures are, however, very misleading.  In practice it is
normally faster to compress data **on disk** and then spend the CPU
time decompressing it than it is to read uncompressed data.  This is
**real** time of course; it does not appear in "user" time or, indeed,
"system" time because those don't include time where the process is
idle; waiting for stuff to be delivered off disk.  It may be that some
modern file systems can reverse this conclusion but when I first
tested this on a commercial system 30 years ago the results were very
very skewed towards compressed data even when the time to compress it
was taken into account!  For PNG we are talking about just the
decompression side which is a lot faster than PNG (LZ) compression.
(LZW is more balanced.)

> I think the hesitation is we wanted to have data to show when it is good/bad, and by how much.

Apple may care to comment.  When I was working on commercial software
I chose to compress images in-memory _to save RAM_   PNG was the
memory and file format for everything except JPEG and, later, GIF.  In
this kind of architecture the full uncompressed image is never
required.  I know that at least one commercial 3D rendering program
can store texture maps compressed when rendering.  I believe this
might actually happen in the NVidia Iray render engine with the
compressed textures on the graphics card.  In all these cases there
**might** be an advantage in parallel decode for large (think print
resolution) images, for current texture maps which are often 4K square
or 8K square and for light domes which are often 16Kx8K and will
certainly be larger in the future.

> For example, very wide images will have more data per row. That leaves fewer places for the new filter & restart marker.

Conceivably someone might make a panorama that is not very high or
have an existing set of images containing multiple frames which are
organised horizontally but so what?  If this technique is applied to
small images the real problem is that the technique involves a "full"
flush and that destroys the state of the compressor; each "segment" of
the image is encoded as a completely independent compressed stream.
The overhead of the 4 byte marker is also not the issue for small
images, it's the requirement at least in mARK for a new IDAT chunk;
that alone has a 12 byte overhead.  Mark has this to say about using a
"full" flush in https://www.zlib.net/manual.html:

"Using Z_FULL_FLUSH too often can seriously degrade compression."

But this does not apply to **large** images to the same extent.  To
use my example of a 4K texture there are 4096 lines and therefore
there can be 4096 segments.  A typical texture is 24-bit RGB so each
line has 12289 bytes and the deflate algorithm uses a sliding window
of at most 32768 bytes.  Consequently dividing a 4K texture into, say,
64 (offering the use of 64 cores at once) gives compressed blocks
corresponding to over 24 window-widths.  It is most unlikely that the
flush would make a significant effect on compression and an extra 16
bytes per segment certainly won't.

This is, in fact, measurable even without an implementation; it should
be a relatively simple hack to any encoder that uses zlib or a similar
implementation.  If the encoder is a library (libpng, libspng etc)
then the result will change the output of all programs that use the
library.  Even commercial software could be tested so long as the
implementation uses a system DLL.

>And it might mean forcing none/sub filters has a bigger impact on file size. OR maybe the bigger rows means the filter was more likely to be none anyway.

That can be tested in the same way but it's also possible to examine
the benefit of each filter on real world PNG files by using a PNG
optimiser which allows the filters to be specified (probably any of
them!)  This works because the filter choice only affects the specific
row, so it is valid to try optimising first with all filters allowed
then with just NONE+SUB and finally check the two sizes; this is the
limiting case though, of course, it is not necessarily the worst case
given that encoders may make just plain bad choices of filters!

I tried oxipng on a 4K (2304x4096) rendered image in 32-bit RGBA format:

`oxipng -P --interlace keep --filters <filters> -nx --zopfli <image>`

Using <filters> of either 0,1,2,3,4,9 (the '9' argument minimizes the
size) or '0,1,9' and got these results:

Raw image bytes: 37'748'736 bytes (100%)
Original: 16632959 bytes (44.1%)
All filters: 16223712 bytes (43.0%)
NONE+SUB: (44.0%)

So in that limiting test using an image which is quite noisy the
filter restriction cost me about 1% in compression.

At this point my own feeling is that the technique has already been
adopted and, apparently, formalized in a public chunk by Apple and,
possibly, libspng (it is not clear to me if Randy has published his
implementation in libspng).  To me that is an a priori demonstration
of utility.  The basic requirement is very very simple and trivial to
implement (at least with zlib); it is the location of the 4-byte
marker (or, as in mARK, the point immediately after a restart marker,
though that isn't completely clear) followed by a deflate block which
starts with a clear state (no reliance on prior data).

iDOT uses a file offset and therefore permits random access at the
file level to "segments" (using Randy's terminology).  From this I
deduce that random access is an Apple requirement but, of course,
Apple should comment.  mARK accommodates this but also allows an IDAT
chunk offset which is more robust but requires searching forward
through the IDAT chunks (they do not have to be read beyond the 4-byte
length).  So at this point reconciliation is required or, of course,
adopting both chunks verbatim (if Apple provide the words...)

I can see no reason **not** to do this.  Apple have already
commandeered the iDOT chunk, so no loss in formalising it because it
is already entirely ignorable.   It's not clear if mARK is present in
publicly readable files at this point; if it were I would say the same
thing, if not it seems to superset the iDOT capability with a more
robust though maybe slightly slower solution.  Maybe Apple could be
persuaded to adopt mARK?

John Bowler

Received on Thursday, 8 May 2025 23:10:42 UTC