Re: Header Compression Overview from James M Snell on 2013-07-02 (ietf-http-wg@w3.org from July to September 2013)

From: James M Snell <jasnell@gmail.com>
Date: Mon, 1 Jul 2013 23:46:02 -0700
To: Roberto Peon <grmocg@gmail.com>
Cc: Mike Bishop <Michael.Bishop@microsoft.com>, ietf-http-wg@w3.org
Message-ID: <CABP7RbdkPX7_Kp5pwH5vcf6k_KHMbvah79_A705DgH5F8g87dw@mail.gmail.com>
Understood,  which is why I agreed the currently spec'd approach is a good
starting point for the first implementation draft. :) I'd honestly be
happier not having stateful header compression at all. But ah well... As
long as whatever we have is clearly explained and extremely well tested, I
can live with it.
On Jul 1, 2013 11:26 PM, "Roberto Peon" <grmocg@gmail.com> wrote:

> This does not adapt well to key-values which change quickly or are small.
> -=R
>
>
> On Mon, Jul 1, 2013 at 11:21 PM, James M Snell <jasnell@gmail.com> wrote:
>
>> After implementing several variations,  I want something even simpler...
>> A fixed index range,  assigned incrementally with rollover at the end and
>> no substitution. It becomes very simple that way.
>>  On Jul 1, 2013 8:29 PM, "Mike Bishop" <Michael.Bishop@microsoft.com>
>> wrote:
>>
>>> Yes -- we just need to clarify the eviction behavior.  I actually prefer
>>> the evict-from-front model, because it means there's no need for metadata
>>> to include a timestamp on each entry in the table.  You know which item
>>> you're going to evict, because it's clear.
>>>
>>> -----Original Message-----
>>> From: James M Snell [mailto:jasnell@gmail.com]
>>> Sent: Monday, July 1, 2013 5:41 PM
>>> To: Mike Bishop
>>> Cc: ietf-http-wg@w3.org
>>> Subject: Re: Header Compression Overview
>>>
>>> Ok, so I'm taking another look through the current header compression
>>> draft and I'd swear that it said some things differently the last time I
>>> read it ;-) ... (which is, of course, impossible... so I'm going to blame
>>> it on the 110+ degree heat here today and the fact that I've read so many
>>> different versions of header compression drafts over the last few months
>>> that my specific recollection of any individual version is a bit gooey in
>>> the middle). So.. with that said, allow me to make a few corrections to the
>>> explanation... Let's see if this is a bit closer to reality:
>>>
>>> The sender and receiver each maintain their own synchronized view of the
>>> compression context.
>>>
>>> The compression context consists of two components:
>>>
>>>   1. A dynamic table of name+value pairs
>>>   2. The current "reference set" of name+value pairs
>>>
>>> When the connection is first established the reference set (#2) is empty
>>> and the header table is prepopulated.
>>>
>>> The sender constructs a logical set of header name+value pairs, where
>>> each name can have multiple values. Each distinct name+value combination is
>>> treated independently of the others.
>>>
>>> For each name+value pair in the set, the sender needs to determine
>>> whether or not to use an Indexed or Literal representation. The Indexed
>>> representation is appropriate if the name+value pair already exists in
>>> either the header table (#1). The Literal representation is appropriate if
>>> the name+value pair either does not exist in #1, or if the sender does not
>>> want the name+value pair stored in the compression context (more on this
>>> part later)
>>>
>>> For example, let's assume that the header table initially consists of
>>> three name+value pairs:
>>>
>>>   0 = :method = GET
>>>   1 = :method = POST
>>>   2 = :path = /
>>>
>>> Then, let's assume that I've created a new connection and the first set
>>> of headers I send to the server contains the header fields:
>>>
>>>   :method = GET
>>>   :path = /
>>>   foo1 = bar
>>>   foo2 = baz
>>>
>>> When I process this set of header fields, I first check each pair
>>> against the header table and find that :method = GET uses the index
>>> (0) while :path = / uses the index (2). Since I don't find foo1 or
>>> foo2 in the static table, I determine that I have to use the Literal
>>> representation for those two fields.
>>>
>>>   :method = GET  ==>  Indexed as (0)
>>>   :path = / ==> Indexed as (2)
>>>   foo1 = bar ==> Literal
>>>   foo2 = bar ==> Literal
>>>
>>> Now, still on the sending side, once I determine the kind of
>>> representation, I need to determine what changes need to be made to the
>>> reference set (#2). The reference set is what the sender and receiver both
>>> use to determine what set of header fields are currently active. Since this
>>> is a brand new connection, the reference set is empty, so with this first
>>> set of headers, the changes to the reference set are straightforward... I'm
>>> adding everything...
>>>
>>> To add an Indexed representation of a header, all I do is reference it's
>>> index.. that's easy enough to do...
>>>
>>> To add a Literal representation of a header, there are two options:
>>> With Indexing, or Without Indexing. With Indexing means that when the
>>> header field is added to the Reference set (#2), the name+value pair is
>>> also added to the header table (#1). Without indexing means that when the
>>> header field is added to the Reference set (#2) the
>>> name+value pair is NOT added to the dynamic table. Which of these to
>>> use to entirely up to the sender. So I'm going to decide that foo1 is
>>> going to be With Indexing, while foo2 is going to be Without Indexing.
>>>
>>> (there is another option here between incremental vs. substitution
>>> indexing but I'll get to that in a few minutes)
>>>
>>> At this point, I have a set of instructions of how to modify the
>>> reference set:
>>>
>>>   1. Add Index #0
>>>   2. Add Index #2
>>>   3. Add Literal "foo1 = bar" with Indexing
>>>   4. Add Literal "foo2 = baz" without Indexing
>>>
>>> At this point, I can create the serialized header block that will be
>>> transmitted within the HEADERS frame. This block ONLY contains the
>>> serialized set of instructions for modifying the reference set.
>>>
>>> For instructions 1 and 2 (adding index #0 and index #2), the
>>> serialization is very simple and requires exactly two bytes (0x81 0x83). (I
>>> have to bump up the index values by 1 when serializing for some reason)
>>>
>>> For instruction #3, it's slightly more complicated, but not much... we
>>> have a single prefix byte (0x40) followed by the length prefixed header
>>> name (0x04 0x67 0x6F 0x6F 0x31) and the length prefixed header value (0x03
>>> 0x62 0x61 0x72).
>>>
>>> For instruction #4, the encoding is only slightly different. The single
>>> prefix byte is 0x60 followed by the length prefixed header name
>>> (0x04 0x67 0x6F 0x6F 0x32) and the length prefixed value (0x03 0x62
>>> 0x61 0x7A).
>>>
>>> Putting it all together, the complete sequence of octets that I will
>>> send within the HEADERS frame is:
>>>
>>>  0x81 0x83 0x40 0x04
>>>  0x67 0x6F 0x6F 0x31
>>>  0x03 0x62 0x61 0x72
>>>  0x60 0x04 0x67 0x6F
>>>  0x6F 0x32 0x03 0x62
>>>  0x61 0x7A
>>>
>>> Once I generate this sequence, a new entry is added to the header table
>>> (#1):
>>>
>>>   3 = foo1 = bar
>>>
>>> And the reference set (#2) has been updated to include the four headers
>>> I sent:
>>>
>>>   :method = GET
>>>   :path = /
>>>   foo1 = bar
>>>   foo2 = baz
>>>
>>> (this is a simplification actually but it's close enough for now)
>>>
>>> I send the serialized instructions to the other endpoint. It receives
>>> those and reverses the process first by parsing the octet sequence to
>>> determine the exact set of instructions it needs to follow to modify it's
>>> own view of the reference set. Once it knows what the operations are, it
>>> begins making the necessary changes to the reference set.
>>> First, it sees two Index representations, so it looks those index values
>>> up in the header table (#1). It finds Index #0 and Index #2 in the static
>>> table so it adds those name+value pairs to it's reference set. Next, it
>>> finds the Literal With Indexing instruction for "foo1 = bar". Since it is
>>> "With Indexing", the server adds the name+value pair to the reference set
>>> *AND* to the header table. Next it finds the Literal Without Indexing
>>> instructions for "foo2 = baz". Since it is "Without Indexing", the server
>>> only adds the name+value pair to the reference set and notes the fact that
>>> the header was not indexed at all.
>>>
>>> At this point, the sender and receiver should have identical views of
>>> both the header table and reference set.
>>>
>>> So far so good.
>>>
>>> Let's move on... back on the sending side. Now I want to send another
>>> request to the server. Some of the header fields remain the same, some have
>>> different values, some are no longer used at all.
>>>
>>>   :method = POST
>>>   :path = /
>>>   foo1 = new
>>>   foo3 = new
>>>
>>> When serializing these headers, I'm going to follow the same process,
>>> but since our header table and reference set are both different, the
>>> serialized results are going to be a bit different.
>>>
>>> Going through each of the headers, I first determine that :method = POST
>>> is already contained in the header table at Index #1. I also note that
>>> :path = / is also in the header table using Index #2 (this hasn't changed).
>>>
>>> Also looking at the header table, I see that there is a "foo1" header
>>> with a different value sitting at Index #3. I make note of that index
>>> position so I can use it next.
>>>
>>> Lastly, I determine that the new header field "foo3 = new" is not in
>>> header table at all. I'm going to have to use the Literal representation
>>> for that one.
>>>
>>> Once I've determined the basic type of representation, I need to
>>> determine how the current reference set needs to be changed. As a reminder,
>>> here's what the current reference set looks like:
>>>
>>>   :method = GET
>>>   :path = /
>>>   foo1 = bar
>>>   foo2 = baz
>>>
>>> Immediately, I see that :method = GET needs to be replaced with :method
>>> = POST, since we know the static index positions for both of those, we can
>>> simple turn one off by referencing it's index and turn the other on (also
>>> by referencing it's index). That requires a simple two-byte sequence (0x81
>>> 0x82).
>>>
>>> I see that :path = / is already in the reference set. So for that one, I
>>> don't have to do anything! Awesome. Let's move on.
>>>
>>> We have an index position for the header field foo1 but not the full
>>> name+value pair "foo1=bar". Remember when I briefly mentioned
>>> Incremental Indexing vs. Substitution indexing? We need to make a
>>> decision. When Incremental Indexing, I can add "foo1 = new" as an entirely
>>> new entry in the dynamic table (#2) or I can replace the value of the
>>> existing "foo1 = bar" entry sitting at index #3. The choice is entirely up
>>> to the sender and can be based on a number of factors... the most important
>>> of which is basic memory management. For now, let's keep things simple and
>>> use Incremental Indexing. We want to minimize the amount of data we're
>>> going to send, however, so instead of sending the header field name as a
>>> length prefixed string, we're going to reference the existing index
>>> position and send along a new value ( 0x44 0x03 0x6E 0x65 0x77 ). (again, I
>>> have to bump up the value of index value by 1)
>>>
>>> Moving on, we determine that "foo2 = baz" needs to be removed from the
>>> reference set. We see, however, that "foo2 = baz" was previously sent as a
>>> Literal without any indexing. This means that it will automatically be
>>> removed from the reference set the next time around without the sender
>>> having to do anything! That's helpful.
>>>
>>> Now all we need to deal with is the new "foo2 = new" header field, we'll
>>> do that using Literal With Indexing, like we did originally with the "foo1
>>> = bar" header field in the first request..  0x40 0x04 0x67 0x6F 0x6F 0x33
>>> 0x03 0x6E 0x65 0x77.
>>>
>>> The complete sequence of octets for the second request is:
>>>
>>>   0x81 0x82 0x44 0x03
>>>   0x6E 0x65 0x77 0x40
>>>   0x04 0x67 0x6F 0x6F
>>>   0x33 0x03 0x6E 0x65
>>>   0x77
>>>
>>> After I construct this sequence, my header table has five entries:
>>>
>>>   0 = :method = GET
>>>   1 = :method = POST
>>>   2 = :path = /
>>>   3 = foo1 = bar
>>>   4 = foo1 = new
>>>
>>> And my reference set looks like:
>>>
>>>   :method = POST
>>>   :path = /
>>>   foo1 = new
>>>   foo3 = new
>>>
>>> I send the HEADERS frame to the server and it deconstructs the set of
>>> instructions, applies those to it's own reference set, updates it's header
>>> table, and goes on from there.
>>>
>>> The process can get a bit complicated at times, but that's the basic
>>> operation. There are a few other things to consider:
>>>
>>> The size of the dynamic table can be bound by the recipient. The sender
>>> is not permitted to exceed that size. If adding a new item to the dynamic
>>> table will cause the table to grow beyond this limit, existing name+value
>>> pairs need to be removed, starting with the
>>> name+value pairs that were least-recently written to the table.
>>>
>>> (((Ed. Note: We need clarity on this next part: )))
>>>
>>> That sounds straightforward, however, given the distinction between
>>> Incremental and Substitution indexing, the Least Recently Written rule can
>>> become a bit complicated in the implementation. Using Substitution indexing
>>> resets the Write time for that particular index, meaning that eviction does
>>> not necessary occur in Index order.
>>>
>>> For example, let's suppose that I send one HEADERS frame that
>>> incrementally adds three items to the dynamic table. Let's call those #1,
>>> #2 and #3. Later, I send a new HEADERS that substitutes a new value for
>>> Index #2. Last, I send a new HEADERS frame that adds two new headers for #4
>>> and #5. The only problem is, adding #4 and #5 to the dynamic table will
>>> cause it to exceed it's size limit, so I need to evict some of the existing
>>> entries. Because of the substitution of #2, Items #1 and #3 are the first
>>> to be evicted (in that order). So the dynamic table contents become #2, #4,
>>> and #5).
>>>
>>> (((Ed. Note: Or is it.... that Substitution DOES NOT change the
>>> Least-Recently-Written eviction order? In which case, a more recently
>>> substituted value can be dropped before a less-least-recently written
>>> Incremental value? That doesn't feel right to me...)))
>>>
>>> Eviction needs to be handled carefully with regards to the current
>>> reference set. Suppose, for instance, that in one request, I add "foo1 =
>>> bar" to the header table and to the current reference set. In the *same
>>> request*, I add "foo2 = baz" to the header table and to the current
>>> reference set, however, adding this second item causes the header table to
>>> exceed it's defined size, forcing the previously written "foo1 = bar" to be
>>> evicted...
>>>
>>> On the receiving end, I'll see the instruction to add "foo1 = bar"
>>> first, which will add it to the reference set and the header table, then
>>> I'll see the instruction to add "foo2 = baz", which adds it to the
>>> reference set and the header table, but evicts "foo1 = bar" from the
>>> dynamic table. Despite this eviction, both "foo1 = bar" and "foo2 = baz"
>>> are both STILL in the reference set.
>>>
>>> The next time around, "foo1 = bar" ends up being treated as a
>>> non-indexed literal and is dropped automatically from the reference set if
>>> it is not explicitly mentioned in the new set of headers.
>>>
>>> Another consideration that must be carefully weighed is the choice
>>> between Incremental and Substitution indexing. If an implementation always
>>> chooses to use Incremental Indexing, Eviction will become the only reliable
>>> means of memory management. Long lived connections will build up full
>>> compression contexts that tie up a potentially significant amount of
>>> memory. This could be dangerous, especially since there is no explicit
>>> mechanism for deleting items from the header table.
>>>
>>> Substitution Indexing provides only a partial solution to this problem
>>> by making it possible to reuse existing index positions. The problem with
>>> using Substitution indexing, however, is that it's not clear yet how it
>>> impact the LRU eviction semantics (see the Ed. Note above)
>>>
>>> ...
>>>
>>> Ok, I *think* that's a more accurate description. Do folks agree?
>>>
>>> - James
>>>
>>>
>>>
>
Received on Tuesday, 2 July 2013 06:46:33 UTC