Optimizing Huffman inputs

Now that we have a (promised) final Huffman table in HPACK, I was able
to do the little project I've been meaning to do.

With differing encoded lengths the Huffman tables add, it's not better
to use certain characters in preference to others.  Combined with the
restrictions on the strings we encode, this can make encoding binary
blobs less than optimal.

Base64 (and the URI-safe variant) manage pretty well because those
characters are favoured by HPACK.  Same goes for the cookie
characters.

I wanted to see if it was possible to do better.

It is.  But not by much.

I've put my code on github [1] and npm [2]. This module is essentially
a replacement for base64 that is optimized for HPACK and for different
character sets: the characters permitted in cookies, the characters
permitted in URI path components, and the characters permitted in
query components.

The results for 100,000 different 20 byte sequences, chosen at random,
are interesting:

Average sizes: min/ave/max (size compared against Base64+Huffman)
Raw Huffman coding (invalid):           25/48.20803/64 (203.13%)
Base 64 (no Huffman):                   27/27/27 (113.77%)
Base 64 with Huffman:                   20/23.73286/27 (100%)
bhpack cookie with Huffman:             19/23.6052/30 (99.46%)
bhpack URI safe with Huffman:           20/22.98342/28 (96.84%)
bhpack URI query with Huffman:          20/23.01238/28 (96.96%) [3]

This shows that HPACK is pretty good at encoding cookies, but there
are small gains to be had for URI components.  If character
restrictions are wildly different to these common values, this could
be even better.

This hasn't been completely optimized, so there might be a few bits
left to save.  I'm using Huffman encoding as the basis of this - in
reverse, of course - which isn't quite as good as arithmetic or range
coding.  I could probably manage the trailing bits more elegantly.
Pull requests happily accepted.

--Martin

[1] https://github.com/martinthomson/bhpack
[2] npm install bhpack
[3] In theory, this is better than the URI safe option, but as chance
would have it, not in this run.

Received on Saturday, 12 April 2014 19:01:48 UTC