- From: David Krauss <potswa@gmail.com>
- Date: Wed, 11 Jun 2014 07:53:11 +0800
- To: Roberto Peon <grmocg@gmail.com>
- Cc: Martin Thomson <martin.thomson@gmail.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Johnny Graettinger <jgraettinger@chromium.org>, Martin Nilsson <nilsson@opera.com>, HTTP Working Group <ietf-http-wg@w3.org>
- Message-Id: <0D63DA55-3AE7-4A51-AE71-2B206CE82C39@gmail.com>
I did a little investigation about more efficient codings, but I wasn’t sure that anyone still cared about these incremental improvements. This program produces encouraging results: http://ideone.com/e77b6S . The input is a corpus with one line per entry. It’s in C++11 so running it outside ideone.com will require a moderately recent C++ compiler, passing it the -std=c++11 flag. What you see at the output line “ideal size = …” is the projected size (in bytes) of the compressed corpus if each output symbol is given its own Huffman table, populated by the *conditional* probabilities for the next symbol. The succeeding lines give the incremental cost in bytes of combining all those tables, i.e. treating the next-symbol predictions of the given characters as the same. It follows a simple greedy algorithm. At the bottom, you get the projected size as if there were just one Huffman table, which should roughly match existing results. It doesn’t actually do the encoding but calculates an ideal information content, which corresponds more closely to arithmetic encoding. The last few lines should show a large potential gain from having just 3-5 distinct static tables. Using ~1.5 MB of URLs from my browser history, the last few lines of output are: 7485 x (+-./;=[_|~ += ABCDEFGHIJKLMNOPQRSTUVWXYZbj 7999 x &?ir += aeou 19332 x &?aeioru += )*:]cdfghklmnpqstvwxyz 20017 x !"#$%',0123456789<>@\^`{} += (+-./;=ABCDEFGHIJKLMNOPQRSTUVWXYZ[_bj|~ 48991 x !"#$%'(+,-./0123456789;<=>@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\^_`bj{|}~ += &)*:?]acdefghiklmnopqrstuvwxyz single-table size = 989717 = 989718 vs 1425268 This means that for this corpus, there’s a savings of 88 KB or ~9% compound if separate tables are used for: - Lowercase consonants - Numbers - Uppercase letters and most punctuation - Lowercase vowels I’m working on improving the algorithm, I think the scheme should do better if duplicates are allowed between tables. Also, the low-probability tails of the tables should be merged. In principle, runs of Base64 text should be compressed at a constant 25% savings, because that’s the information content of Base64. It’s really not too much to ask. However, you guys have to be aware that Huffman codes are not magic; it’s just reflecting the frequency of the characters you put in. If you arbitrarily want Base64 to encode better, then just increase the frequency of the letters used by it. But remember that this will decrease the relative frequency/probability of all the other symbols, and the flatten the relative differences between the Base64 characters. Let me know if there’s interest here. Otherwise I’ll just pursue this independently, because the idea works well enough. Huffman tables with conditional probability are still just Huffman tables; little generality is lost. It would be very helpful if I could see the corpus you guys are working with. My URL history is simply too small and biased and as you can see from the above results, there are some artifacts that may result from that. On 2014–06–11, at 5:27 AM, Roberto Peon <grmocg@gmail.com> wrote: > I'd prefer to use the table from the larger corpus if folks accepted it, and the bas64 improvement you quote is a fairly big deal. > I have nothing to assure us that my input data was anywhere near the same quality (which would be why I convinced Johnny to do this experiment! Thanks Johnny!!!), and every reason to doubt it. > -=R > > > On Tue, Jun 10, 2014 at 2:20 PM, Martin Thomson <martin.thomson@gmail.com> wrote: > On 10 June 2014 13:40, Mike Bishop <Michael.Bishop@microsoft.com> wrote: > > Given that the difference is slight (less than 3% better), we would support accepting this as confirmation that the current code is good and leave things as they are, unless broader deployment shows that the Huffman table is further off-base for common headers than this indicates. > > The straight base64 tests I've run show a compound performance > increase of ~7.5%, which would seem to be fairly significant. > > I don't know about the size of the data set that Roberto used > originally, nor do I know how good the distribution of both sets of > test data were (I might theorize that these are too WEIRD [1], or > maybe even too biased toward Google's workloads). > > I'm somewhat inclined toward a change here, but mainly due to the > robustness of the data set. Unless Roberto can produce evidence that > his input data was comparable in quality. It took me less than five > minutes to make a change in my code, so that's not the issue. > > [1] http://schott.blogs.nytimes.com/2010/07/14/weird/?_php=true&_type=blogs&_r=0 > >
Received on Tuesday, 10 June 2014 23:53:48 UTC