Re: Optimizing Huffman inputs

On 13/04/2014 9:04 p.m., Ilari Liusvaara wrote:
> On Sat, Apr 12, 2014 at 12:01:21PM -0700, Martin Thomson wrote:
>>
>> With differing encoded lengths the Huffman tables add, it's not better
>> to use certain characters in preference to others.  Combined with the
>> restrictions on the strings we encode, this can make encoding binary
>> blobs less than optimal.
>>
>> Base64 (and the URI-safe variant) manage pretty well because those
>> characters are favoured by HPACK.  Same goes for the cookie
>> characters.
>>
>> I wanted to see if it was possible to do better.
>>
>> The results for 100,000 different 20 byte sequences, chosen at random,
>> are interesting:
>>
>> Average sizes: min/ave/max (size compared against Base64+Huffman)
>> Raw Huffman coding (invalid):           25/48.20803/64 (203.13%)
>> Base 64 (no Huffman):                   27/27/27 (113.77%)
>> Base 64 with Huffman:                   20/23.73286/27 (100%)
>> bhpack cookie with Huffman:             19/23.6052/30 (99.46%)
>> bhpack URI safe with Huffman:           20/22.98342/28 (96.84%)
>> bhpack URI query with Huffman:          20/23.01238/28 (96.96%) [3]
>>
>> This shows that HPACK is pretty good at encoding cookies, but there
>> are small gains to be had for URI components.  If character
>> restrictions are wildly different to these common values, this could
>> be even better.
> 
> The character set of cookies seems odd. Is '(' allowed without quoting?
> And why is there '>' but not '<'?
> 
>> This hasn't been completely optimized, so there might be a few bits
>> left to save.  I'm using Huffman encoding as the basis of this - in
>> reverse, of course - which isn't quite as good as arithmetic or range
>> coding. 
> 
> Is the idea to choose encoding of random octet strings so that it is of
> minimal size after Huffman?
> 
> I tried to calculate how well one could theoretically[1] do at this
> (20 input octets):
> - URI safe: 159-168 bits (average ~167 bits).
> - URI safe (constant length): 169 bits.
> - URI query: 158-166 bits (average ~165 bits).
> - URI query (constant length): 167 bits.
> 
> The cookies have token and quoted-string forms, making calculating these
> figures quite nasty...

Another twist to throw into this mix while you are all trying to figure
it out... Anyone interested in figuring out the answer?

With these huffman tables the DIGIT characters are represented by 4-5
huffman bits, the a-f lower case hex represented by 4-6 huffman bits and
the '%' is represented by 6 huffman bits.

Therefore selectively URL-encoding characters whose huffman
representation is over 17 bits in length will result in a compression
savings.

Encoders can trivially be written with short Huffman table where some
characters huffman symbol is actually the sequence of the three
URL-encoding characters.

Question is whether a decoder can be written with a special case to also
perform url-decode when locating the % symbol and still be fast overall ?


Secondary question; What if the Huffman symbols were further optimized
to reduce the bit length of the URL-encoding characters, and in
particular the %.


Amos

Received on Sunday, 13 April 2014 09:45:29 UTC