Re: Optimizing Huffman inputs

On Sat, Apr 12, 2014 at 12:01:21PM -0700, Martin Thomson wrote:
> 
> With differing encoded lengths the Huffman tables add, it's not better
> to use certain characters in preference to others.  Combined with the
> restrictions on the strings we encode, this can make encoding binary
> blobs less than optimal.
> 
> Base64 (and the URI-safe variant) manage pretty well because those
> characters are favoured by HPACK.  Same goes for the cookie
> characters.
> 
> I wanted to see if it was possible to do better.
> 
> The results for 100,000 different 20 byte sequences, chosen at random,
> are interesting:
> 
> Average sizes: min/ave/max (size compared against Base64+Huffman)
> Raw Huffman coding (invalid):           25/48.20803/64 (203.13%)
> Base 64 (no Huffman):                   27/27/27 (113.77%)
> Base 64 with Huffman:                   20/23.73286/27 (100%)
> bhpack cookie with Huffman:             19/23.6052/30 (99.46%)
> bhpack URI safe with Huffman:           20/22.98342/28 (96.84%)
> bhpack URI query with Huffman:          20/23.01238/28 (96.96%) [3]
> 
> This shows that HPACK is pretty good at encoding cookies, but there
> are small gains to be had for URI components.  If character
> restrictions are wildly different to these common values, this could
> be even better.

The character set of cookies seems odd. Is '(' allowed without quoting?
And why is there '>' but not '<'?

> This hasn't been completely optimized, so there might be a few bits
> left to save.  I'm using Huffman encoding as the basis of this - in
> reverse, of course - which isn't quite as good as arithmetic or range
> coding. 

Is the idea to choose encoding of random octet strings so that it is of
minimal size after Huffman?

I tried to calculate how well one could theoretically[1] do at this
(20 input octets):
- URI safe: 159-168 bits (average ~167 bits).
- URI safe (constant length): 169 bits.
- URI query: 158-166 bits (average ~165 bits).
- URI query (constant length): 167 bits.

The cookies have token and quoted-string forms, making calculating these
figures quite nasty...

The non-constant codes take variable length octet-string inputs and have
property that shorter input produces shorter output.

The constant length ones compress strings of given length so that output
is of constant length after Huffman (but not of constant length before
Huffman pass).


[1] There seems to be encoding and decoding algorithms for these codes
that run in polynomial time, but those are likely too slow for actual use.


-Ilari

Received on Sunday, 13 April 2014 09:04:33 UTC