- From: John Bowler <john.cunningham.bowler@gmail.com>
- Date: Thu, 8 May 2025 16:10:26 -0700
- To: "Chris Blume (ProgramMax)" <programmax@gmail.com>
- Cc: Randy <randy408@protonmail.com>, public-png@w3.org
On Thu, May 8, 2025 at 11:47 AM Chris Blume (ProgramMax) <programmax@gmail.com> wrote: > And for data, how much speed can we gain for how much file size sacrifice? Is the speed gain because most of the time is spent in the inflate step? If we optimize the inflate, does that reduce the reward? Most of the **CPU** time required to decode a PNG goes inside zlib decompression. This is based on many timing experiments using libpng and another liibrary written at the end of 1996. Both used Mark Adler's zlib; zlib-ng is meant to be faster. Within both libpng and, IRC, the other library the only significant CPU consumer is the Paeth filter. As a result the other library and, by default, libpng 1.7, disfavour it's use. These timing figures are, however, very misleading. In practice it is normally faster to compress data **on disk** and then spend the CPU time decompressing it than it is to read uncompressed data. This is **real** time of course; it does not appear in "user" time or, indeed, "system" time because those don't include time where the process is idle; waiting for stuff to be delivered off disk. It may be that some modern file systems can reverse this conclusion but when I first tested this on a commercial system 30 years ago the results were very very skewed towards compressed data even when the time to compress it was taken into account! For PNG we are talking about just the decompression side which is a lot faster than PNG (LZ) compression. (LZW is more balanced.) > I think the hesitation is we wanted to have data to show when it is good/bad, and by how much. Apple may care to comment. When I was working on commercial software I chose to compress images in-memory _to save RAM_ PNG was the memory and file format for everything except JPEG and, later, GIF. In this kind of architecture the full uncompressed image is never required. I know that at least one commercial 3D rendering program can store texture maps compressed when rendering. I believe this might actually happen in the NVidia Iray render engine with the compressed textures on the graphics card. In all these cases there **might** be an advantage in parallel decode for large (think print resolution) images, for current texture maps which are often 4K square or 8K square and for light domes which are often 16Kx8K and will certainly be larger in the future. > For example, very wide images will have more data per row. That leaves fewer places for the new filter & restart marker. Conceivably someone might make a panorama that is not very high or have an existing set of images containing multiple frames which are organised horizontally but so what? If this technique is applied to small images the real problem is that the technique involves a "full" flush and that destroys the state of the compressor; each "segment" of the image is encoded as a completely independent compressed stream. The overhead of the 4 byte marker is also not the issue for small images, it's the requirement at least in mARK for a new IDAT chunk; that alone has a 12 byte overhead. Mark has this to say about using a "full" flush in https://www.zlib.net/manual.html: "Using Z_FULL_FLUSH too often can seriously degrade compression." But this does not apply to **large** images to the same extent. To use my example of a 4K texture there are 4096 lines and therefore there can be 4096 segments. A typical texture is 24-bit RGB so each line has 12289 bytes and the deflate algorithm uses a sliding window of at most 32768 bytes. Consequently dividing a 4K texture into, say, 64 (offering the use of 64 cores at once) gives compressed blocks corresponding to over 24 window-widths. It is most unlikely that the flush would make a significant effect on compression and an extra 16 bytes per segment certainly won't. This is, in fact, measurable even without an implementation; it should be a relatively simple hack to any encoder that uses zlib or a similar implementation. If the encoder is a library (libpng, libspng etc) then the result will change the output of all programs that use the library. Even commercial software could be tested so long as the implementation uses a system DLL. >And it might mean forcing none/sub filters has a bigger impact on file size. OR maybe the bigger rows means the filter was more likely to be none anyway. That can be tested in the same way but it's also possible to examine the benefit of each filter on real world PNG files by using a PNG optimiser which allows the filters to be specified (probably any of them!) This works because the filter choice only affects the specific row, so it is valid to try optimising first with all filters allowed then with just NONE+SUB and finally check the two sizes; this is the limiting case though, of course, it is not necessarily the worst case given that encoders may make just plain bad choices of filters! I tried oxipng on a 4K (2304x4096) rendered image in 32-bit RGBA format: `oxipng -P --interlace keep --filters <filters> -nx --zopfli <image>` Using <filters> of either 0,1,2,3,4,9 (the '9' argument minimizes the size) or '0,1,9' and got these results: Raw image bytes: 37'748'736 bytes (100%) Original: 16632959 bytes (44.1%) All filters: 16223712 bytes (43.0%) NONE+SUB: (44.0%) So in that limiting test using an image which is quite noisy the filter restriction cost me about 1% in compression. At this point my own feeling is that the technique has already been adopted and, apparently, formalized in a public chunk by Apple and, possibly, libspng (it is not clear to me if Randy has published his implementation in libspng). To me that is an a priori demonstration of utility. The basic requirement is very very simple and trivial to implement (at least with zlib); it is the location of the 4-byte marker (or, as in mARK, the point immediately after a restart marker, though that isn't completely clear) followed by a deflate block which starts with a clear state (no reliance on prior data). iDOT uses a file offset and therefore permits random access at the file level to "segments" (using Randy's terminology). From this I deduce that random access is an Apple requirement but, of course, Apple should comment. mARK accommodates this but also allows an IDAT chunk offset which is more robust but requires searching forward through the IDAT chunks (they do not have to be read beyond the 4-byte length). So at this point reconciliation is required or, of course, adopting both chunks verbatim (if Apple provide the words...) I can see no reason **not** to do this. Apple have already commandeered the iDOT chunk, so no loss in formalising it because it is already entirely ignorable. It's not clear if mARK is present in publicly readable files at this point; if it were I would say the same thing, if not it seems to superset the iDOT capability with a more robust though maybe slightly slower solution. Maybe Apple could be persuaded to adopt mARK? John Bowler
Received on Thursday, 8 May 2025 23:10:42 UTC