Re: Optimizing Huffman inputs

On 2014–04–13, at 3:01 AM, Martin Thomson <martin.thomson@gmail.com> wrote:

> This shows that HPACK is pretty good at encoding cookies, but there
> are small gains to be had for URI components.  If character
> restrictions are wildly different to these common values, this could
> be even better.
> 
> This hasn't been completely optimized, so there might be a few bits
> left to save.  

Please see my earlier message on this topic, "Improving on Huffman” sent 3/17/14. In all the use cases, we tend to see sequences of similar characters: numeric, uppercase, lowercase. This corresponds to two bits of reliable prediction.

State was eliminated from the compression for security reasons, but as I argue in the previous message, I think that giving drastically different compression characteristics to uppercase, lowercase, and binary opens more security problems than carrying over two bits from the previous character does.

> I'm using Huffman encoding as the basis of this - in
> reverse, of course - which isn't quite as good as arithmetic or range
> coding.  I could probably manage the trailing bits more elegantly.
> Pull requests happily accepted.

Arithmetic and range codings save only up to one bit per symbol, as I understand. The current symbols tend to be larger because the coding isn’t very efficient, but a better coding with smaller symbols would see a bigger incremental gain.

Received on Sunday, 13 April 2014 05:49:36 UTC