- From: Chris Blume (ProgramMax) <programmax@gmail.com>
- Date: Fri, 9 May 2025 06:34:17 -0400
- To: Leo Barnes <barnes.leo@gmail.com>
- Cc: jbowler@acm.org, Pierre-Anthony Lemieux <pal@sandflow.com>, Randy <randy408@protonmail.com>, public-png@w3.org
- Message-ID: <CAG3W2KfmQWV2uMvDDsA_ki6_0mJj7PUmaXN2tPN=Zsd1Bz_q0w@mail.gmail.com>
That's a good point. Since storage access will be slower, it's useful to keep the compressed image prefetched & ready in memory. And if you had infinite memory you might pre-decode them. But since that isn't reality, there is some limit on how much you want to pre-decode. At which point this parallel decoding becomes very useful again :D On Fri, May 9, 2025 at 5:01 AM Leo Barnes <barnes.leo@gmail.com> wrote: > > If you instead decode as the packets/pages are loaded, the format which > compresses better will use fewer packets/pages and thus decode faster. > Unless decoding is *so slow* that storage/network is faster. But there are > several orders of magnitude in the CPU's favor here. > > This makes a bunch of assumptions that may be valid for web usage. For the > web, the transfer speed may indeed be the limiting factor. > > But for generic OS frameworks I would say it's pretty common to have image > files already fully loaded in memory. At which point the decoding > speed is the limiting factor and multithreading ends up being very > important. > > As an example take the generic grid view of images. Say you have 10000 > images in the grid, but only N are actually visible on the screen at > any given time. The fully decoded images are much larger to keep in memory > than the compressed image files. If you want to optimize for > scroll speed *and* memory you would do something like this: > - Keep 5 * N image files loaded into memory > - Decode the visible N image files and show those on screen > - (optional) Preemptively decode the next image to be displayed if the > user scrolls in either direction > > Cheers, > //Leo > > On Fri, 9 May 2025 at 10:20, Chris Blume (ProgramMax) < > programmax@gmail.com> wrote: > >> @Pierre-Anthony, I've tagged you in a section below--highlighted in >> purple--which we might want to change about the image benchmarks. >> >> >> >> >>> These timing figures are, however, very misleading. In practice it is >>> normally faster to compress data **on disk** and then spend the CPU >>> time decompressing it than it is to read uncompressed data. >> >> >> Agreed :) >> I have a little rant about this on Twitter. >> >> I tried explaining to the QOI creator that their stated goal of fast >> decode is reasonable but I disagreed with their method and results. They >> load the entire file and then time the decodes. Almost never do you have >> the entire file in memory already. The only real cases I can think of are >> ROMs or if the image is stored inside the executable. (Or maybe with OSes >> today, which use spare memory to preload hot files. If you are loading the >> same image over and over frequently, then maybe? But that is a far-fetched >> situation.) >> >> If you instead decode as the packets/pages are loaded, the format which >> compresses better will use fewer packets/pages and thus decode faster. >> Unless decoding is *so slow* that storage/network is faster. But there are >> several orders of magnitude in the CPU's favor here. >> >> I think a better way to test would be to first write a program that >> records and plays back the storage device's timing of data arrival. That >> way you can simulate the real device's real behavior, removing inter-run >> file caching by the device/OS. And then the test can read from memory, >> decoding at that device's real pace. This gives a better measure of the >> format, rather than measuring a bad implementation / usage. >> >> I know Pierre-Anthony has an image benchmark. Perhaps we can work on this? >> >> Granted, lots of code does this the easier way--loading the entire image >> in one step, then decoding the entire image in one step. So it is a >> common-enough scenario to care about. But the solution to that isn't "Which >> format is faster?" The solution is fixing the bad implementation. >> >> FWIW, Chrome decodes images as packets arrive. >> But it does so with the granularity of a row of pixels, which is common >> in many image decoding libraries. >> So wide images might re-decode the same pixels several times (if there >> wasn't enough data to finish the row). >> >> ***NOTE: All of this would undermine the value of parallel decoding.*** >> >> Except for animations in Chrome, which waits until the 2nd+ frame is >> completely downloaded and then does a big decode. That is intentional: This >> generally happens on page load, and we want to focus the CPU on the load. >> (Although, a better implementation might be to prioritize tasks and put >> 2nd+ frame decode tasks as a lower priority. I don't have data on when a >> low-priority task might slip through and block an important load task.) >> >> >> >> Apple may care to comment. >> >> ... >> >> which are often 4K square >> >> or 8K square and for light domes which are often 16Kx8K and will >>> certainly be larger in the future. >> >> >> Apple made an iDOT chunk for parallel decoding. So I certainly would >> appreciate their input. :) >> >> The way I see it: >> >> - If decoding happens faster than data is loaded (and loading happens >> serially) then the image size shouldn't matter. >> - If decoding happens faster BUT there can be parallel streams*, this >> is a candidate for parallel decoding. >> - If decoding is not faster than data load**, candidate for parallel >> decoding. >> - If we accept that too many apps do it the bad way (one load call, >> one decode call)***, MAYBE this is a candidate for parallel decoding. >> >> * I'm not sure if this is a thing that exists. Maybe NVMe drives can >> service multiple requests across their multiple PCIe lanes and the OS >> doesn't already do this? But if that's the case, it feels like the OS >> should be fixed. So I don't think this is a great candidate. >> >> ** This is highly situational. Is the program using an optimized decoder >> library? Is the device CPU highly constrained? We'll want to gather >> situational data on decode speeds to try to plot out the Pareto front. >> >> *** If apps aren't going to change to the new way, are they going to >> change and use the parallel APIs? Surely, the library shouldn't create its >> own threads. But heck, maybe if the situation is that dire. Or if the image >> decode is an OS call and the OS already created worker threads. >> >> >> >> the real problem is that the technique involves a "full" >>> flush and that destroys the state of the compressor; >> >> ... >> >> Mark has this to say about using a >>> "full" flush in https://www.zlib.net/manual.html: >>> >>> "Using Z_FULL_FLUSH too often can seriously degrade compression." >> >> >> Right. But suppose an image is a solid background and then suddenly a >> high-contrast-pixel-to-pixel logo. >> The existing compressor state wouldn't have been helpful anyway when it >> got to the logo. So might as well throw it out. >> >> And if you follow that example, you can extend it with: any filter which >> uses UP is strongly indicating that it reuses the compressor's state. If it >> *doesn't* use UP there is still a chance, but a more awkward/forced one. >> Like the repeated pattern can be found up & left 3 pixels. Or 2 rows up >> (and hopefully that is still in the window). >> >> Part of our data collection would likely be how strong is the correlation >> between filter w/ UP and reusing compressor state. If we can show a strong >> correlation, we probably pay very little penalty with the Z_FULL_FLUSH. >> >> >> >> But this does not apply to **large** images to the same extent. >> >> ... >>> It is most unlikely that the flush would make a significant effect on >>> compression... >> >> >> Agreed. :) >> >> >> >> So in that limiting test using an image which is quite noisy the >>> filter restriction cost me about 1% in compression. >> >> >> That is WAYY better than I expected. >> In my head, I was imagining people weighing 10% file size for 4x decode >> speed. >> Granted, it is a large image. But not *that* large today. And we aren't >> resetting the compressor yet. >> This is great news! >> Especially given the UP & compressor reuse I expect. >> >> >> >> ...formalized in a public chunk by Apple... >> >> >> I think iDOT is not formally specified. >> I think it has been mostly (but not 100%) reverse engineered. >> >> Unless Apple provides a formal specification, I think we would have to >> reinvent a differently named chunk using that iDOT-like behavior. >> >> >> >> To me that is an a priori demonstration >>> of utility. The basic requirement is very very simple and trivial to >>> implement (at least with zlib) >> >> >> Agreed. >> It also carves a path for Area of Interest (AoI) / Region of Interest >> (RoI), which I planned on proposing after the core idea of deflate reset & >> marker was accepted. >> Even if we decode faster than data loads and deem parallel to not be >> fruitful, those concepts *are* still useful for AoI/RoI. >> >> >> I can see no reason **not** to do this. >> >> >> I'm also eager for it. :) >> Especially since adding it to the Fourth Edition docket is just to work >> on it, not guarantee it goes in. >> >> But it is reasonable to want clear data on when it is and isn't >> beneficial. And that data will certainly help us shape the proposal itself. >> So I get the idea of not adding it to the docket until that data is ready. >> >> On Thu, May 8, 2025 at 7:10 PM John Bowler < >> john.cunningham.bowler@gmail.com> wrote: >> >>> On Thu, May 8, 2025 at 11:47 AM Chris Blume (ProgramMax) >>> <programmax@gmail.com> wrote: >>> > And for data, how much speed can we gain for how much file size >>> sacrifice? Is the speed gain because most of the time is spent in the >>> inflate step? If we optimize the inflate, does that reduce the reward? >>> >>> Most of the **CPU** time required to decode a PNG goes inside zlib >>> decompression. This is based on many timing experiments using libpng >>> and another liibrary written at the end of 1996. Both used Mark >>> Adler's zlib; zlib-ng is meant to be faster. Within both libpng and, >>> IRC, the other library the only significant CPU consumer is the Paeth >>> filter. As a result the other library and, by default, libpng 1.7, >>> disfavour it's use. >>> >>> These timing figures are, however, very misleading. In practice it is >>> normally faster to compress data **on disk** and then spend the CPU >>> time decompressing it than it is to read uncompressed data. This is >>> **real** time of course; it does not appear in "user" time or, indeed, >>> "system" time because those don't include time where the process is >>> idle; waiting for stuff to be delivered off disk. It may be that some >>> modern file systems can reverse this conclusion but when I first >>> tested this on a commercial system 30 years ago the results were very >>> very skewed towards compressed data even when the time to compress it >>> was taken into account! For PNG we are talking about just the >>> decompression side which is a lot faster than PNG (LZ) compression. >>> (LZW is more balanced.) >>> >>> > I think the hesitation is we wanted to have data to show when it is >>> good/bad, and by how much. >>> >>> Apple may care to comment. When I was working on commercial software >>> I chose to compress images in-memory _to save RAM_ PNG was the >>> memory and file format for everything except JPEG and, later, GIF. In >>> this kind of architecture the full uncompressed image is never >>> required. I know that at least one commercial 3D rendering program >>> can store texture maps compressed when rendering. I believe this >>> might actually happen in the NVidia Iray render engine with the >>> compressed textures on the graphics card. In all these cases there >>> **might** be an advantage in parallel decode for large (think print >>> resolution) images, for current texture maps which are often 4K square >>> or 8K square and for light domes which are often 16Kx8K and will >>> certainly be larger in the future. >>> >>> > For example, very wide images will have more data per row. That leaves >>> fewer places for the new filter & restart marker. >>> >>> Conceivably someone might make a panorama that is not very high or >>> have an existing set of images containing multiple frames which are >>> organised horizontally but so what? If this technique is applied to >>> small images the real problem is that the technique involves a "full" >>> flush and that destroys the state of the compressor; each "segment" of >>> the image is encoded as a completely independent compressed stream. >>> The overhead of the 4 byte marker is also not the issue for small >>> images, it's the requirement at least in mARK for a new IDAT chunk; >>> that alone has a 12 byte overhead. Mark has this to say about using a >>> "full" flush in https://www.zlib.net/manual.html: >>> >>> "Using Z_FULL_FLUSH too often can seriously degrade compression." >>> >>> But this does not apply to **large** images to the same extent. To >>> use my example of a 4K texture there are 4096 lines and therefore >>> there can be 4096 segments. A typical texture is 24-bit RGB so each >>> line has 12289 bytes and the deflate algorithm uses a sliding window >>> of at most 32768 bytes. Consequently dividing a 4K texture into, say, >>> 64 (offering the use of 64 cores at once) gives compressed blocks >>> corresponding to over 24 window-widths. It is most unlikely that the >>> flush would make a significant effect on compression and an extra 16 >>> bytes per segment certainly won't. >>> >>> This is, in fact, measurable even without an implementation; it should >>> be a relatively simple hack to any encoder that uses zlib or a similar >>> implementation. If the encoder is a library (libpng, libspng etc) >>> then the result will change the output of all programs that use the >>> library. Even commercial software could be tested so long as the >>> implementation uses a system DLL. >>> >>> >And it might mean forcing none/sub filters has a bigger impact on file >>> size. OR maybe the bigger rows means the filter was more likely to be none >>> anyway. >>> >>> That can be tested in the same way but it's also possible to examine >>> the benefit of each filter on real world PNG files by using a PNG >>> optimiser which allows the filters to be specified (probably any of >>> them!) This works because the filter choice only affects the specific >>> row, so it is valid to try optimising first with all filters allowed >>> then with just NONE+SUB and finally check the two sizes; this is the >>> limiting case though, of course, it is not necessarily the worst case >>> given that encoders may make just plain bad choices of filters! >>> >>> I tried oxipng on a 4K (2304x4096) rendered image in 32-bit RGBA format: >>> >>> `oxipng -P --interlace keep --filters <filters> -nx --zopfli <image>` >>> >>> Using <filters> of either 0,1,2,3,4,9 (the '9' argument minimizes the >>> size) or '0,1,9' and got these results: >>> >>> Raw image bytes: 37'748'736 bytes (100%) >>> Original: 16632959 bytes (44.1%) >>> All filters: 16223712 bytes (43.0%) >>> NONE+SUB: (44.0%) >>> >>> So in that limiting test using an image which is quite noisy the >>> filter restriction cost me about 1% in compression. >>> >>> At this point my own feeling is that the technique has already been >>> adopted and, apparently, formalized in a public chunk by Apple and, >>> possibly, libspng (it is not clear to me if Randy has published his >>> implementation in libspng). To me that is an a priori demonstration >>> of utility. The basic requirement is very very simple and trivial to >>> implement (at least with zlib); it is the location of the 4-byte >>> marker (or, as in mARK, the point immediately after a restart marker, >>> though that isn't completely clear) followed by a deflate block which >>> starts with a clear state (no reliance on prior data). >>> >>> iDOT uses a file offset and therefore permits random access at the >>> file level to "segments" (using Randy's terminology). From this I >>> deduce that random access is an Apple requirement but, of course, >>> Apple should comment. mARK accommodates this but also allows an IDAT >>> chunk offset which is more robust but requires searching forward >>> through the IDAT chunks (they do not have to be read beyond the 4-byte >>> length). So at this point reconciliation is required or, of course, >>> adopting both chunks verbatim (if Apple provide the words...) >>> >>> I can see no reason **not** to do this. Apple have already >>> commandeered the iDOT chunk, so no loss in formalising it because it >>> is already entirely ignorable. It's not clear if mARK is present in >>> publicly readable files at this point; if it were I would say the same >>> thing, if not it seems to superset the iDOT capability with a more >>> robust though maybe slightly slower solution. Maybe Apple could be >>> persuaded to adopt mARK? >>> >>> John Bowler >>> >>
Received on Friday, 9 May 2025 10:34:33 UTC