- From: Roberto Peon <grmocg@gmail.com>
- Date: Thu, 4 Apr 2013 16:55:53 -0700
- To: HTTP Working Group <ietf-http-wg@w3.org>
- Message-ID: <CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
Here is some data based on an analysis of a perfect (i.e. resource
unconstrained) atom-based compressor.
I've removed any serialization sizes from this-- this will hold true for
HeaderDiff or Delta or whatever, presupposing it does atom-based
compression.
Feel free to jump to the bottom for take-aways if looking at the data is
boring :)
The top most compressible headers are (key% is the percentage of the
compressible bytes where were from the key):
|compressed
key-name | bytes key% val%
-----------------------+-----------------------
user-agent | 2443472 9.37% 90.63%
cookie | 1957141 2.88% 97.12%
referer | 1340198 11.89% 88.11%
via | 980712 4.53% 95.47%
accept-charset | 700174 32.62% 67.38%
accept-language | 685850 48.86% 51.14%
accept-encoding | 676516 49.53% 50.47%
cache-control | 571590 50.27% 49.73%
date | 544049 16.31% 83.69%
x-cache-lookup | 523453 37.32% 62.68%
content-type | 521956 50.88% 49.12%
:host | 506321 23.66% 76.34%
accept | 416278 32.65% 67.35%
x-cache | 414766 28.26% 71.74%
last-modified | 398843 58.89% 41.11%
proxy-connection | 389619 63.48% 36.52%
server | 369522 32.27% 67.73%
:path | 337031 35.55% 64.45%
content-length | 313985 94.79% 5.21%
expires | 305406 35.69% 64.31%
:method | 239569 70.02% 29.98%
:status | 237574 70.61% 29.39%
p3p | 206589 2.93% 97.07%
accept-ranges | 198471 73.02% 26.98%
content-encoding | 124912 81.30% 18.70%
vary | 110368 22.38% 77.62%
age | 70211 66.89% 33.11%
set-cookie | 66181 23.90% 76.10%
etag | 62416 48.63% 51.37%
x-powered-by | 54561 46.49% 53.51%
x-content-type-options | 47171 77.61% 22.39%
x-varnish-server | 33947 51.61% 48.39%
x-varnish | 33376 89.04% 10.96%
x-xss-protection | 28685 58.73% 41.27%
x-cdn | 27428 23.06% 76.94%
location | 26152 19.79% 80.21%
transfer-encoding | 25147 72.94% 27.06%
xcache | 24800 18.75% 81.25%
...
The distribution of backreferences is falls off extremely quickly as
distance from the newest element increases:
1 35460
2 20774
3 16886
4 12926
5 9601
6 6947
7 5100
8 4304
9 3658
10 3313
11 2789
12 2710
13 2678
14 2354
15 2372
16 2180
17 2066
18 2023
19 1979
20 1885
21 1888
22 1799
23 1724
24 1673
25 1584
26 1530
27 1506
28 1435
29 1395
30 1305
31 1355
32 1283
33 1305
34 1341
35 1339
36 1178
37 1200
38 1223
39 1169
40 1180
41 1111
42 1105
43 1074
44 1079
45 1043
46 1050
47 1030
48 972
49 963
50 942
51 935
52 916
...
I've attached a png of a graph of this (y axis=frequency, x-axis=dist from
newest element).
The knee in the graph is very nice indeed.
Take aways?
- We need a better survey of headers from everywhere :)
- Compression over our corpus should scale favorably with small table
(and state) size.
- Encoding index as dist-from-newest really works well, and LRU appears
to be extremely effective as an expiration policy (the attached graph looks
good).
- We're getting substantial compression from both key and value
backreferences/tokenization.
- Algorithmically, there isn't a whole lot to do-- the devil is really
in the serialization details and the tradeoffs involved in
generating/parsing. There are obvious tweaks that compressors could do when
space constrained (e.g. looking at the first table, above, as the likely
benefit and making decisions based upon that), but the data which suggests
that the LRU is so effective also suggests that this benefit is likely
limited unless they can predict the future :)
-=R
Attachments
- application/octet-stream attachment: freq_vs_dist_from_newest_element.xcf
Received on Thursday, 4 April 2013 23:56:21 UTC