Re: Call for Consensus: Remove "reference set" from HPACK (to address #552) from Willy Tarreau on 2014-07-18 (ietf-http-wg@w3.org from July to September 2014)

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 18 Jul 2014 07:49:10 +0200
To: Greg Wilkins <gregw@intalio.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <20140718054910.GH15425@1wt.eu>
Hi Greg,

On Fri, Jul 18, 2014 at 10:13:09AM +1000, Greg Wilkins wrote:
> Willy,
> 
> I see what you are saying.  However my experience with hpack is that
> fiddling the bit encodings is going to give you +/- <1% difference.

In fact I'm not trying to save much, I feel concerned by the fact that with
static table below the dynamic one, we'll always require one extra byte for
any litteral whose name isn't in the static table. And switching the two
very likely changes the importance of the header fields that were put in
the static table, because initially it was not *that* important, mostly
for the first request, now it's for all requests. Since most header fields
appear to be there in the static table, it's not a big deal, but I'd prefer
that we ensure we don't miss any.

For example, some browsers send "TE". Here it's not present in the static
table, probably because it was not worth consuming an entry for a 2-byte
name which would end up in the dynamic table after the first reqeust. But
now, referencing it from the dynamic table will systematically require one
extra byte if the value changes. I gave the example of the XFF header which
should not be an issue over the links where byte count matters though.

> I can make +/- >5% differences by picking different encoding strategies and
> see similar differences over different stories over the test data.
> 
> So I think tweaking the encodings at this stage is really just operating
> within the noise of the different headers.   So I don't think we should
> change hpack in this way.  We just don't have the data to optimise the last
> 1% nor do we know if natural variation is such that it is pointless to try
> to find a one-size-pefectly-fit-all solution.

I'm really not fond of changing it either because I think it's properly
designed. But if we change the fondamental principle of the dynamic table
containing any recently emitted header and with a low index, this changes
the index distribution.

> Removing RefSet is definitely a good thing to do.
> 
> Removing the copy of static entries to the dynamic table make measurable
> impact on compression, so I think we should remove it for simplicity and to
> save CPU - it only being added because of the RefSet.
> 
> With those removed,  I'm 90% happy with hpack
> 
> Once you remove the static copy, I do think there is an argument to be made
> to revert the static table to below the dynamic table. This is for
> simplicity and because it will allow the common field indexes to be
> precomputed as a single byte, saving CPU.

I know and am not against this, quite the opposite in fact. You may remember
that two years ago I was discussing if we shouldn't encode common headers as
a single byte or so :-)

> It does not appear to affect the compression efficiency in any significant
> way.

That's the point where I think we should be more careful and ensure we have
enough relevant data (eg: some mobile browsers requests captured before the
operators' transparent proxies).

If at least we had the ability to encode both static table and most recent
dynamic entries with a single index, I would feel better. For example you
can have another approach :
  - positive indexes = static table index
  - negative indexes = - dynamic table index

You encode (index + 10), so that you can encode up to the last 10 emitted
fields, and you can encode up to 54 static headers in a single byte (it's
just a matter of proper sorting but it's already reasonably clean). I think
you get the idea.

> With this I'd be 95% happy.
> 
> Encoding dates as integers does appear to give some additional
> compression.  But apparently it has been proposed and rejected before by
> the WG.     So while I think it would be good, I'm not going to advocate
> that we revisit at this stage.

Same here, it's a desirable optimization but if we go down that route, we'll
continue with specific integer encoding for content-length and accept-ranges,
then we'll start to suggest that we encode all common content-type tokens as
a few bits, etc... And we're back to redefining the whole encoder. So I'd
rather avoid touching it in this regard.

BTW, I think that with the data you currently have and with your encoder, you
could check how often a literal refers to the dynamic table in your data set,
which will be a good indication of the relevance or not of my concern above.

Thanks,
Willy
Received on Friday, 18 July 2014 05:49:34 UTC