Compression analysis of perfect atom-based compressor

Here is some data based on an analysis of a perfect (i.e. resource
unconstrained) atom-based compressor.
I've removed any serialization sizes from this-- this will hold true for
HeaderDiff or Delta or whatever, presupposing it does atom-based
compression.

Feel free to jump to the bottom for take-aways if looking at the data is
boring :)

The top most compressible headers are (key% is the percentage of the
compressible bytes where were from the key):
                       |compressed
key-name               |  bytes     key%  val%
-----------------------+-----------------------
user-agent             |  2443472  9.37% 90.63%
cookie                 |  1957141  2.88% 97.12%
referer                |  1340198 11.89% 88.11%
via                    |   980712  4.53% 95.47%
accept-charset         |   700174 32.62% 67.38%
accept-language        |   685850 48.86% 51.14%
accept-encoding        |   676516 49.53% 50.47%
cache-control          |   571590 50.27% 49.73%
date                   |   544049 16.31% 83.69%
x-cache-lookup         |   523453 37.32% 62.68%
content-type           |   521956 50.88% 49.12%
:host                  |   506321 23.66% 76.34%
accept                 |   416278 32.65% 67.35%
x-cache                |   414766 28.26% 71.74%
last-modified          |   398843 58.89% 41.11%
proxy-connection       |   389619 63.48% 36.52%
server                 |   369522 32.27% 67.73%
:path                  |   337031 35.55% 64.45%
content-length         |   313985 94.79%  5.21%
expires                |   305406 35.69% 64.31%
:method                |   239569 70.02% 29.98%
:status                |   237574 70.61% 29.39%
p3p                    |   206589  2.93% 97.07%
accept-ranges          |   198471 73.02% 26.98%
content-encoding       |   124912 81.30% 18.70%
vary                   |   110368 22.38% 77.62%
age                    |    70211 66.89% 33.11%
set-cookie             |    66181 23.90% 76.10%
etag                   |    62416 48.63% 51.37%
x-powered-by           |    54561 46.49% 53.51%
x-content-type-options |    47171 77.61% 22.39%
x-varnish-server       |    33947 51.61% 48.39%
x-varnish              |    33376 89.04% 10.96%
x-xss-protection       |    28685 58.73% 41.27%
x-cdn                  |    27428 23.06% 76.94%
location               |    26152 19.79% 80.21%
transfer-encoding      |    25147 72.94% 27.06%
xcache                 |    24800 18.75% 81.25%
...

The distribution of backreferences is falls off extremely quickly as
distance from the newest element increases:
1       35460
2       20774
3       16886
4       12926
5       9601
6       6947
7       5100
8       4304
9       3658
10      3313
11      2789
12      2710
13      2678
14      2354
15      2372
16      2180
17      2066
18      2023
19      1979
20      1885
21      1888
22      1799
23      1724
24      1673
25      1584
26      1530
27      1506
28      1435
29      1395
30      1305
31      1355
32      1283
33      1305
34      1341
35      1339
36      1178
37      1200
38      1223
39      1169
40      1180
41      1111
42      1105
43      1074
44      1079
45      1043
46      1050
47      1030
48      972
49      963
50      942
51      935
52      916
...

I've attached a png of a graph of this (y axis=frequency, x-axis=dist from
newest element).
The knee in the graph is very nice indeed.


Take aways?

   - We need a better survey of headers from everywhere :)
   - Compression over our corpus should scale favorably with small table
   (and state) size.
   - Encoding index as dist-from-newest really works well, and LRU appears
   to be extremely effective as an expiration policy (the attached graph looks
   good).
   - We're getting substantial compression from both key and value
   backreferences/tokenization.
   - Algorithmically, there isn't a whole lot to do-- the devil is really
   in the serialization details and the tradeoffs involved in
   generating/parsing. There are obvious tweaks that compressors could do when
   space constrained (e.g. looking at the first table, above, as the likely
   benefit and making decisions based upon that), but the data which suggests
   that the LRU is so effective also suggests that this benefit is likely
   limited unless they can predict the future :)


-=R

Received on Thursday, 4 April 2013 23:56:21 UTC