Re: Compression analysis of perfect atom-based compressor

How did you handle connection coalescing (the "streamifier" issue)?

Cheers,


On 05/04/2013, at 10:55 AM, Roberto Peon <grmocg@gmail.com> wrote:

> Here is some data based on an analysis of a perfect (i.e. resource unconstrained) atom-based compressor.
> I've removed any serialization sizes from this-- this will hold true for HeaderDiff or Delta or whatever, presupposing it does atom-based compression.
> 
> Feel free to jump to the bottom for take-aways if looking at the data is boring :)
> 
> The top most compressible headers are (key% is the percentage of the compressible bytes where were from the key):
>                        |compressed
> key-name               |  bytes     key%  val%
> -----------------------+-----------------------
> user-agent             |  2443472  9.37% 90.63%
> cookie                 |  1957141  2.88% 97.12%
> referer                |  1340198 11.89% 88.11%
> via                    |   980712  4.53% 95.47%
> accept-charset         |   700174 32.62% 67.38%
> accept-language        |   685850 48.86% 51.14%
> accept-encoding        |   676516 49.53% 50.47%
> cache-control          |   571590 50.27% 49.73%
> date                   |   544049 16.31% 83.69%
> x-cache-lookup         |   523453 37.32% 62.68%
> content-type           |   521956 50.88% 49.12%
> :host                  |   506321 23.66% 76.34%
> accept                 |   416278 32.65% 67.35%
> x-cache                |   414766 28.26% 71.74%
> last-modified          |   398843 58.89% 41.11%
> proxy-connection       |   389619 63.48% 36.52%
> server                 |   369522 32.27% 67.73%
> :path                  |   337031 35.55% 64.45%
> content-length         |   313985 94.79%  5.21%
> expires                |   305406 35.69% 64.31%
> :method                |   239569 70.02% 29.98%
> :status                |   237574 70.61% 29.39%
> p3p                    |   206589  2.93% 97.07%
> accept-ranges          |   198471 73.02% 26.98%
> content-encoding       |   124912 81.30% 18.70%
> vary                   |   110368 22.38% 77.62%
> age                    |    70211 66.89% 33.11%
> set-cookie             |    66181 23.90% 76.10%
> etag                   |    62416 48.63% 51.37%
> x-powered-by           |    54561 46.49% 53.51%
> x-content-type-options |    47171 77.61% 22.39%
> x-varnish-server       |    33947 51.61% 48.39%
> x-varnish              |    33376 89.04% 10.96%
> x-xss-protection       |    28685 58.73% 41.27%
> x-cdn                  |    27428 23.06% 76.94%
> location               |    26152 19.79% 80.21%
> transfer-encoding      |    25147 72.94% 27.06%
> xcache                 |    24800 18.75% 81.25%
> ...
> 
> The distribution of backreferences is falls off extremely quickly as distance from the newest element increases:
> 1       35460
> 2       20774
> 3       16886
> 4       12926
> 5       9601
> 6       6947
> 7       5100
> 8       4304
> 9       3658
> 10      3313
> 11      2789
> 12      2710
> 13      2678
> 14      2354
> 15      2372
> 16      2180
> 17      2066
> 18      2023
> 19      1979
> 20      1885
> 21      1888
> 22      1799
> 23      1724
> 24      1673
> 25      1584
> 26      1530
> 27      1506
> 28      1435
> 29      1395
> 30      1305
> 31      1355
> 32      1283
> 33      1305
> 34      1341
> 35      1339
> 36      1178
> 37      1200
> 38      1223
> 39      1169
> 40      1180
> 41      1111
> 42      1105
> 43      1074
> 44      1079
> 45      1043
> 46      1050
> 47      1030
> 48      972
> 49      963
> 50      942
> 51      935
> 52      916
> ...
> 
> I've attached a png of a graph of this (y axis=frequency, x-axis=dist from newest element).
> The knee in the graph is very nice indeed.
> 
> 
> Take aways?
> 	• We need a better survey of headers from everywhere :)
> 	• Compression over our corpus should scale favorably with small table (and state) size.
> 	• Encoding index as dist-from-newest really works well, and LRU appears to be extremely effective as an expiration policy (the attached graph looks good).
> 	• We're getting substantial compression from both key and value backreferences/tokenization.
> 	• Algorithmically, there isn't a whole lot to do-- the devil is really in the serialization details and the tradeoffs involved in generating/parsing. There are obvious tweaks that compressors could do when space constrained (e.g. looking at the first table, above, as the likely benefit and making decisions based upon that), but the data which suggests that the LRU is so effective also suggests that this benefit is likely limited unless they can predict the future :)
> 
> -=R
> 
> 

--
Mark Nottingham   http://www.mnot.net/

Received on Friday, 5 April 2013 00:03:11 UTC