Re: Compression analysis of perfect atom-based compressor from Roberto Peon on 2013-04-05 (ietf-http-wg@w3.org from April to June 2013)

From: Roberto Peon <grmocg@gmail.com>
Date: Thu, 4 Apr 2013 17:52:31 -0700
To: Mark Nottingham <mnot@mnot.net>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CAP+FsNegsAR16OaM5C-h6Xi67YSHh5uOWGHf2D1auQV7OxmYbA@mail.gmail.com>
It used the default streamifier.
Happy to provide data with whatever streamifier...
-=R
On Apr 4, 2013 5:02 PM, "Mark Nottingham" <mnot@mnot.net> wrote:

> How did you handle connection coalescing (the "streamifier" issue)?
>
> Cheers,
>
>
> On 05/04/2013, at 10:55 AM, Roberto Peon <grmocg@gmail.com> wrote:
>
> > Here is some data based on an analysis of a perfect (i.e. resource
> unconstrained) atom-based compressor.
> > I've removed any serialization sizes from this-- this will hold true for
> HeaderDiff or Delta or whatever, presupposing it does atom-based
> compression.
> >
> > Feel free to jump to the bottom for take-aways if looking at the data is
> boring :)
> >
> > The top most compressible headers are (key% is the percentage of the
> compressible bytes where were from the key):
> >                        |compressed
> > key-name               |  bytes     key%  val%
> > -----------------------+-----------------------
> > user-agent             |  2443472  9.37% 90.63%
> > cookie                 |  1957141  2.88% 97.12%
> > referer                |  1340198 11.89% 88.11%
> > via                    |   980712  4.53% 95.47%
> > accept-charset         |   700174 32.62% 67.38%
> > accept-language        |   685850 48.86% 51.14%
> > accept-encoding        |   676516 49.53% 50.47%
> > cache-control          |   571590 50.27% 49.73%
> > date                   |   544049 16.31% 83.69%
> > x-cache-lookup         |   523453 37.32% 62.68%
> > content-type           |   521956 50.88% 49.12%
> > :host                  |   506321 23.66% 76.34%
> > accept                 |   416278 32.65% 67.35%
> > x-cache                |   414766 28.26% 71.74%
> > last-modified          |   398843 58.89% 41.11%
> > proxy-connection       |   389619 63.48% 36.52%
> > server                 |   369522 32.27% 67.73%
> > :path                  |   337031 35.55% 64.45%
> > content-length         |   313985 94.79%  5.21%
> > expires                |   305406 35.69% 64.31%
> > :method                |   239569 70.02% 29.98%
> > :status                |   237574 70.61% 29.39%
> > p3p                    |   206589  2.93% 97.07%
> > accept-ranges          |   198471 73.02% 26.98%
> > content-encoding       |   124912 81.30% 18.70%
> > vary                   |   110368 22.38% 77.62%
> > age                    |    70211 66.89% 33.11%
> > set-cookie             |    66181 23.90% 76.10%
> > etag                   |    62416 48.63% 51.37%
> > x-powered-by           |    54561 46.49% 53.51%
> > x-content-type-options |    47171 77.61% 22.39%
> > x-varnish-server       |    33947 51.61% 48.39%
> > x-varnish              |    33376 89.04% 10.96%
> > x-xss-protection       |    28685 58.73% 41.27%
> > x-cdn                  |    27428 23.06% 76.94%
> > location               |    26152 19.79% 80.21%
> > transfer-encoding      |    25147 72.94% 27.06%
> > xcache                 |    24800 18.75% 81.25%
> > ...
> >
> > The distribution of backreferences is falls off extremely quickly as
> distance from the newest element increases:
> > 1       35460
> > 2       20774
> > 3       16886
> > 4       12926
> > 5       9601
> > 6       6947
> > 7       5100
> > 8       4304
> > 9       3658
> > 10      3313
> > 11      2789
> > 12      2710
> > 13      2678
> > 14      2354
> > 15      2372
> > 16      2180
> > 17      2066
> > 18      2023
> > 19      1979
> > 20      1885
> > 21      1888
> > 22      1799
> > 23      1724
> > 24      1673
> > 25      1584
> > 26      1530
> > 27      1506
> > 28      1435
> > 29      1395
> > 30      1305
> > 31      1355
> > 32      1283
> > 33      1305
> > 34      1341
> > 35      1339
> > 36      1178
> > 37      1200
> > 38      1223
> > 39      1169
> > 40      1180
> > 41      1111
> > 42      1105
> > 43      1074
> > 44      1079
> > 45      1043
> > 46      1050
> > 47      1030
> > 48      972
> > 49      963
> > 50      942
> > 51      935
> > 52      916
> > ...
> >
> > I've attached a png of a graph of this (y axis=frequency, x-axis=dist
> from newest element).
> > The knee in the graph is very nice indeed.
> >
> >
> > Take aways?
> >       • We need a better survey of headers from everywhere :)
> >       • Compression over our corpus should scale favorably with small
> table (and state) size.
> >       • Encoding index as dist-from-newest really works well, and LRU
> appears to be extremely effective as an expiration policy (the attached
> graph looks good).
> >       • We're getting substantial compression from both key and value
> backreferences/tokenization.
> >       • Algorithmically, there isn't a whole lot to do-- the devil is
> really in the serialization details and the tradeoffs involved in
> generating/parsing. There are obvious tweaks that compressors could do when
> space constrained (e.g. looking at the first table, above, as the likely
> benefit and making decisions based upon that), but the data which suggests
> that the LRU is so effective also suggests that this benefit is likely
> limited unless they can predict the future :)
> >
> > -=R
> >
>
> >
>
> --
> Mark Nottingham   http://www.mnot.net/
>
>
>
>
>
Received on Friday, 5 April 2013 00:53:03 UTC