- From: Roberto Peon <grmocg@gmail.com>
- Date: Thu, 4 Apr 2013 16:55:53 -0700
- To: HTTP Working Group <ietf-http-wg@w3.org>
- Message-ID: <CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
Here is some data based on an analysis of a perfect (i.e. resource unconstrained) atom-based compressor. I've removed any serialization sizes from this-- this will hold true for HeaderDiff or Delta or whatever, presupposing it does atom-based compression. Feel free to jump to the bottom for take-aways if looking at the data is boring :) The top most compressible headers are (key% is the percentage of the compressible bytes where were from the key): |compressed key-name | bytes key% val% -----------------------+----------------------- user-agent | 2443472 9.37% 90.63% cookie | 1957141 2.88% 97.12% referer | 1340198 11.89% 88.11% via | 980712 4.53% 95.47% accept-charset | 700174 32.62% 67.38% accept-language | 685850 48.86% 51.14% accept-encoding | 676516 49.53% 50.47% cache-control | 571590 50.27% 49.73% date | 544049 16.31% 83.69% x-cache-lookup | 523453 37.32% 62.68% content-type | 521956 50.88% 49.12% :host | 506321 23.66% 76.34% accept | 416278 32.65% 67.35% x-cache | 414766 28.26% 71.74% last-modified | 398843 58.89% 41.11% proxy-connection | 389619 63.48% 36.52% server | 369522 32.27% 67.73% :path | 337031 35.55% 64.45% content-length | 313985 94.79% 5.21% expires | 305406 35.69% 64.31% :method | 239569 70.02% 29.98% :status | 237574 70.61% 29.39% p3p | 206589 2.93% 97.07% accept-ranges | 198471 73.02% 26.98% content-encoding | 124912 81.30% 18.70% vary | 110368 22.38% 77.62% age | 70211 66.89% 33.11% set-cookie | 66181 23.90% 76.10% etag | 62416 48.63% 51.37% x-powered-by | 54561 46.49% 53.51% x-content-type-options | 47171 77.61% 22.39% x-varnish-server | 33947 51.61% 48.39% x-varnish | 33376 89.04% 10.96% x-xss-protection | 28685 58.73% 41.27% x-cdn | 27428 23.06% 76.94% location | 26152 19.79% 80.21% transfer-encoding | 25147 72.94% 27.06% xcache | 24800 18.75% 81.25% ... The distribution of backreferences is falls off extremely quickly as distance from the newest element increases: 1 35460 2 20774 3 16886 4 12926 5 9601 6 6947 7 5100 8 4304 9 3658 10 3313 11 2789 12 2710 13 2678 14 2354 15 2372 16 2180 17 2066 18 2023 19 1979 20 1885 21 1888 22 1799 23 1724 24 1673 25 1584 26 1530 27 1506 28 1435 29 1395 30 1305 31 1355 32 1283 33 1305 34 1341 35 1339 36 1178 37 1200 38 1223 39 1169 40 1180 41 1111 42 1105 43 1074 44 1079 45 1043 46 1050 47 1030 48 972 49 963 50 942 51 935 52 916 ... I've attached a png of a graph of this (y axis=frequency, x-axis=dist from newest element). The knee in the graph is very nice indeed. Take aways? - We need a better survey of headers from everywhere :) - Compression over our corpus should scale favorably with small table (and state) size. - Encoding index as dist-from-newest really works well, and LRU appears to be extremely effective as an expiration policy (the attached graph looks good). - We're getting substantial compression from both key and value backreferences/tokenization. - Algorithmically, there isn't a whole lot to do-- the devil is really in the serialization details and the tradeoffs involved in generating/parsing. There are obvious tweaks that compressors could do when space constrained (e.g. looking at the first table, above, as the likely benefit and making decisions based upon that), but the data which suggests that the LRU is so effective also suggests that this benefit is likely limited unless they can predict the future :) -=R
Attachments
- application/octet-stream attachment: freq_vs_dist_from_newest_element.xcf
Received on Thursday, 4 April 2013 23:56:21 UTC