delta-encoding compressor code

delta-encoding compressor code

From: Roberto Peon <grmocg@gmail.com>
Date: Thu, 29 Nov 2012 14:51:11 -0800
Message-ID: <CAP+FsNed22u_hHw5WPZX7+MvSXeB+t8HBJh8wxdQsk0ZXrzr5Q@mail.gmail.com>
To: HTTP Working Group <ietf-http-wg@w3.org>
I've rewritten the python version, which now does encoding/decoding of the
new format, which is smaller than the old version
I've also cleaned up the code a fair bit and added some documentation for
the more important bits.

As usual, the code is available here:

An example output over my dataset, generated by doing ./headers_sample.py
-v 0 ../test-data/*.har
   is this:
                                       http1   |   spdy3   |   spdy4
Req                Compressed Sums:    830525  |   944453  |   106185
Req              Uncompressed Sums:     67237  |    87886  |   106185
Rsp                Compressed Sums:    508189  |   627505  |   152962
Rsp              Uncompressed Sums:    105626  |   128226  |   152962
Req   Compressed/uncompressed HTTP:   0.08096  |  0.10582  |  0.12785
Rsp   Compressed/uncompressed HTTP:   0.20785  |  0.25232  |  0.30099

As a reminder compressing HTTP/1.X or SPDY3 (raw) with gzip isn't safe, and
is included only for comparison/reference.

There are a few parameters in headers_codec.py which may be interesting to
play with (and the majority of the TODOs indicating my thoughts on future
research direction/work).
In particular, look for:
string_length_field_bitlen, strings_use_eof, strings_padded_to_byte_boundary,
and strings_use_huffman

One thing which is a TODO here is that I haven't ensured that the first-bit
of any huffman-encoded string is 1 (which is something I did intend to have
in there, but isn't super critical right now), which is one possible way of
indicating that a string is not huffman-encoded.

As a reminder, though the python version is not made for performance, the
compressor here is made with performance in mind, and includes features to
attempt to ensure that high-throughput proxies can operate efficiently, and
as a result we do trade-off some compression.

The c++ version of this (which only currently does compression) is useful
for determining approximate speed (~3X faster than gzip when doing
compression, which proxies should be able to avoid)
