Re: #578: getting real-ish numbers for option 3

Hi Mark,

On Fri, Oct 24, 2014 at 09:33:12PM +1100, Mark Nottingham wrote:
> Toy up at:
>   https://gist.github.com/mnot/434ab029a6e878b2af4c

Thank you, I could use it. I noticed that the random names it produces
can sometimes be used as a custom fixed header, sometimes as a custom
random header. It's no big deal, but I think it does not accurately
model reality since we'd rather have some fixed headers (eg: customer
name) and some always random ones (eg: signature, timestamp). I also
thought that we could have a few partially random values (those who
change from time to time such as x-forwarded-for behind a proxy), but
I don't think it will change things a lot anyway.

So in turn I have worked today :-)

I implemented a simple encoder which parses your program's output and
emits statistics on the output data. It does not emit the output bytes,
it just performs the encoding and counts. It's almost nothing to add,
it is just that I had no use for the output.

It supports 4 encodings :
  - draft 09
  - the proposal I sent that was called "option 3"
  - the proposed revision I sent just after it
  - Greg's proposed revision

It reports various statistics such as number of strings encoded, number
of integers encoded, average integer size etc... I have run some tests
all on the same output from your program, and got interesting findings
already :

Draft-09 :
        Total input bytes : 7455384
        Total output bytes : 2318395            (100%)
        Overall compression ratio : 0.310969    (100%)
        Total encoded integers: 218865
        Total encoded integers bytes: 295036    (100%)
        Avg bytes per integers: 1.348027        (100%)

option3 :
        Total input bytes : 7455384
        Total output bytes : 2268350            (97.84%)
        Overall compression ratio : 0.304257    (97.84%)
        Total encoded integers: 218865
        Total encoded integers bytes: 244991    (83.03%)
        Avg bytes per integers: 1.119370        (83.03%)

revised option3 :
        Total input bytes : 7455384
        Total output bytes : 2264722            (97.68%)
        Overall compression ratio : 0.303770    (97.68%)
        Total encoded integers: 218865
        Total encoded integers bytes: 241363    (81.81%)
        Avg bytes per integers: 1.102794        (81.81%)

Greg's revision :
        Total input bytes : 7455384
        Total output bytes : 2280713            (98.37%)
        Overall compression ratio : 0.305915    (98.37%)
        Total encoded integers: 218865
        Total encoded integers bytes: 257354    (87.23%)
        Avg bytes per integers: 1.175857        (87.23%)

First, the overall compression ratio is never exceptional given that the
input contains a significant amount of random data, so that's expected.
Second, we observe that the integer encoding is 17-18% smaller compared
to draft-09. And if we consider the integer encoding's overhead, then it
is even divided by 3.4 (0.34 byte to 0.10 byte per integer).

The overall savings are 2.1% for "option 3", 2.3% for its revision, and
1.7% for Greg's proposal. To my initial surprise, Greg's proposal provides
less savings here despite being balanced. But in the end there's a reason,
it offers more bits to literals while it's the case where we already have
to pay for the literal overhead so the occasional saving of 1-byte doesn't
save much.

I have experimented with an option in the code to write fully random headers
as literal-without-indexing (as the producer would do, but not a gateway
which doesn't know which ones are stable and which ones are not). And while
doing so improves the compression ratio, the offset from draft-09 and the
other ones does not change.

I have not yet tried to modify your program to vary the output between a
browser (less custom) or a partner site (more custom). But I wanted to share
these results already as I think they can be helpful.

All the code is available here :

  https://github.com/wtarreau/http2-exp

The readme is ugly when parsed as md, I've never written md docs so it
seems I'm lacking some basic practice here. But I'm sure nobody will
care, reading it in the console or as raw is OK.

Ah, there's also a debug mode which indicates what encoding is chosen for
each field and how long the resulting sequence is. It helped me debug it,
and I found it useful to understand how the table evolves.

Comments welcome. It's my first HPACK encoder, it's very possible that I
messed up a lot with certain things, though I didn't notice that. In any
case, feel free to comment/fork/fix/etc.

Best regards,
Willy

Received on Friday, 24 October 2014 18:56:34 UTC