Re: incrementally indexed headers should be inserted in index '0' instead of len(table)+1 from Martin Thomson on 2013-08-15 (ietf-http-wg@w3.org from July to September 2013)

From: Martin Thomson <martin.thomson@gmail.com>
Date: Thu, 15 Aug 2013 14:21:38 -0700
To: Roberto Peon <grmocg@gmail.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CABkgnnU1BmR-HjTPpPz4BW+MtUfjawhdeFpCRksdBg22iWD+0A@mail.gmail.com>

That's an interesting graph.  Let me just clarify two things:

1. Does this show the frequency distribution for the distance from the
current appearance of a given name/value pair to the most recent appearance
of the same pair.  Aren't we more interested in the distance to the first
appearance of the same pair?  Would it make sense to also have data on
distance between last appearance and first appearance as well (maybe we do
want the triangular distribution resulting from multiple consecutive
:method=GET pairings, and number of reuses is interesting, but duration of
usage is also interesting)?

2. When you say "distance", is this number of messages since last seeing
the same values, or is it the number of headers processed since seeing the
same value?


On 15 August 2013 14:06, Roberto Peon <grmocg@gmail.com> wrote:

> I did this testing a while back (look for email entitled "Compression
> analysis of perfect atom-based compressor").
>
> Someone can hack something up to do this again on the new corpus, but the
> data was pretty unequivocal.
> I've attached a PNG version of the freq vs dist-from-newest-element graph.
>
> Again, this was with a "perfect" FIFO-based compressor, which had no
> expiry and infinite space over the test corpus (with each domain having a
> separate compressor, but summing into the final total) that was available
> to us before the new larger, if munged, corpus that was recently
> contributed.
>
>
> This is a graph of the number of times that you'd like to have
> backreferenced something (y-axis) at the distance from the newest element
> (x-axis).
> If we always found that we wished to backreference the 10th most recent
> item (which is most certainly not the case, but IF), then this graph would
> be a single spike at 10.
>
> And here is the data.
> [image: Inline image 1]
> I'd welcome someone computing this again!
>
> -=R
>
>
> On Thu, Aug 15, 2013 at 1:30 PM, Martin Thomson <martin.thomson@gmail.com>wrote:
>
>> On 15 August 2013 13:00, Roberto Peon <grmocg@gmail.com> wrote:
>> > Currently, index 0 is the 'top' of the table, and when a new header is
>> added
>> > using incremental indexing, it is added at the 'bottom' (i.e. the
>> highest
>> > number index).
>>
>> There are two separate issues here.  Strangely, the right decision
>> depends on the same question.  See below.
>>
>> The first is whether the most recently added values should appear at
>> index 0 or index n when adding to the table.  This wouldn't matter at
>> all if it weren't for the fact that small values take less space to
>> encode.  So, any decision we make should be based on what we believe
>> is going to be reused more often: things that are added recently, or
>> things that have been present for a long time.  If it's recent items
>> that are being used more often, we probably need to add at index 0.
>>
>> The second relates to eviction: which items be evicted from the header
>> table?  Those that were most recently added, or those that have been
>> around for a long time?  As I hinted before, this depends on the same
>> question.
>>
>> Intuitively at least, I believe that there will be a split on the
>> header reuse question.  Some things, like :method=GET, are probably
>> going to be quite stable.  Other things, like cookies, tend to change
>> such that old values are invalid (Cookie in particular might be a good
>> candidate for substitution).  If there is a split, then I can imagine
>> that some optimizations could give good returns... maybe something
>> analogous to a generational garbage collector.  But then, I can't
>> imagine a way to avoid the resulting performance and complexity
>> issues.
>>
>> But that's just analysis.  Testing might be an easier way to decide,
>> or at least to inform the choice.
>>
>
>

Attachments

image/png attachment: freq_vs_dist_from_newest_element.png

Received on Thursday, 15 August 2013 21:22:05 UTC