RE: Header Compression Overview from James M Snell on 2013-07-02 (ietf-http-wg@w3.org from July to September 2013)

From: James M Snell <jasnell@gmail.com>
Date: Mon, 1 Jul 2013 23:21:30 -0700
To: Mike Bishop <Michael.Bishop@microsoft.com>
Cc: ietf-http-wg@w3.org
Message-ID: <CABP7RbfauE_w8ESUKS8xFYkH01Ax-2T85tuUSh35PkzqCvH_PQ@mail.gmail.com>
After implementing several variations,  I want something even simpler...  A
fixed index range,  assigned incrementally with rollover at the end and no
substitution. It becomes very simple that way.
On Jul 1, 2013 8:29 PM, "Mike Bishop" <Michael.Bishop@microsoft.com> wrote:

> Yes -- we just need to clarify the eviction behavior.  I actually prefer
> the evict-from-front model, because it means there's no need for metadata
> to include a timestamp on each entry in the table.  You know which item
> you're going to evict, because it's clear.
>
> -----Original Message-----
> From: James M Snell [mailto:jasnell@gmail.com]
> Sent: Monday, July 1, 2013 5:41 PM
> To: Mike Bishop
> Cc: ietf-http-wg@w3.org
> Subject: Re: Header Compression Overview
>
> Ok, so I'm taking another look through the current header compression
> draft and I'd swear that it said some things differently the last time I
> read it ;-) ... (which is, of course, impossible... so I'm going to blame
> it on the 110+ degree heat here today and the fact that I've read so many
> different versions of header compression drafts over the last few months
> that my specific recollection of any individual version is a bit gooey in
> the middle). So.. with that said, allow me to make a few corrections to the
> explanation... Let's see if this is a bit closer to reality:
>
> The sender and receiver each maintain their own synchronized view of the
> compression context.
>
> The compression context consists of two components:
>
>   1. A dynamic table of name+value pairs
>   2. The current "reference set" of name+value pairs
>
> When the connection is first established the reference set (#2) is empty
> and the header table is prepopulated.
>
> The sender constructs a logical set of header name+value pairs, where each
> name can have multiple values. Each distinct name+value combination is
> treated independently of the others.
>
> For each name+value pair in the set, the sender needs to determine whether
> or not to use an Indexed or Literal representation. The Indexed
> representation is appropriate if the name+value pair already exists in
> either the header table (#1). The Literal representation is appropriate if
> the name+value pair either does not exist in #1, or if the sender does not
> want the name+value pair stored in the compression context (more on this
> part later)
>
> For example, let's assume that the header table initially consists of
> three name+value pairs:
>
>   0 = :method = GET
>   1 = :method = POST
>   2 = :path = /
>
> Then, let's assume that I've created a new connection and the first set of
> headers I send to the server contains the header fields:
>
>   :method = GET
>   :path = /
>   foo1 = bar
>   foo2 = baz
>
> When I process this set of header fields, I first check each pair against
> the header table and find that :method = GET uses the index
> (0) while :path = / uses the index (2). Since I don't find foo1 or
> foo2 in the static table, I determine that I have to use the Literal
> representation for those two fields.
>
>   :method = GET  ==>  Indexed as (0)
>   :path = / ==> Indexed as (2)
>   foo1 = bar ==> Literal
>   foo2 = bar ==> Literal
>
> Now, still on the sending side, once I determine the kind of
> representation, I need to determine what changes need to be made to the
> reference set (#2). The reference set is what the sender and receiver both
> use to determine what set of header fields are currently active. Since this
> is a brand new connection, the reference set is empty, so with this first
> set of headers, the changes to the reference set are straightforward... I'm
> adding everything...
>
> To add an Indexed representation of a header, all I do is reference it's
> index.. that's easy enough to do...
>
> To add a Literal representation of a header, there are two options:
> With Indexing, or Without Indexing. With Indexing means that when the
> header field is added to the Reference set (#2), the name+value pair is
> also added to the header table (#1). Without indexing means that when the
> header field is added to the Reference set (#2) the
> name+value pair is NOT added to the dynamic table. Which of these to
> use to entirely up to the sender. So I'm going to decide that foo1 is
> going to be With Indexing, while foo2 is going to be Without Indexing.
>
> (there is another option here between incremental vs. substitution
> indexing but I'll get to that in a few minutes)
>
> At this point, I have a set of instructions of how to modify the reference
> set:
>
>   1. Add Index #0
>   2. Add Index #2
>   3. Add Literal "foo1 = bar" with Indexing
>   4. Add Literal "foo2 = baz" without Indexing
>
> At this point, I can create the serialized header block that will be
> transmitted within the HEADERS frame. This block ONLY contains the
> serialized set of instructions for modifying the reference set.
>
> For instructions 1 and 2 (adding index #0 and index #2), the serialization
> is very simple and requires exactly two bytes (0x81 0x83). (I have to bump
> up the index values by 1 when serializing for some reason)
>
> For instruction #3, it's slightly more complicated, but not much... we
> have a single prefix byte (0x40) followed by the length prefixed header
> name (0x04 0x67 0x6F 0x6F 0x31) and the length prefixed header value (0x03
> 0x62 0x61 0x72).
>
> For instruction #4, the encoding is only slightly different. The single
> prefix byte is 0x60 followed by the length prefixed header name
> (0x04 0x67 0x6F 0x6F 0x32) and the length prefixed value (0x03 0x62
> 0x61 0x7A).
>
> Putting it all together, the complete sequence of octets that I will send
> within the HEADERS frame is:
>
>  0x81 0x83 0x40 0x04
>  0x67 0x6F 0x6F 0x31
>  0x03 0x62 0x61 0x72
>  0x60 0x04 0x67 0x6F
>  0x6F 0x32 0x03 0x62
>  0x61 0x7A
>
> Once I generate this sequence, a new entry is added to the header table
> (#1):
>
>   3 = foo1 = bar
>
> And the reference set (#2) has been updated to include the four headers I
> sent:
>
>   :method = GET
>   :path = /
>   foo1 = bar
>   foo2 = baz
>
> (this is a simplification actually but it's close enough for now)
>
> I send the serialized instructions to the other endpoint. It receives
> those and reverses the process first by parsing the octet sequence to
> determine the exact set of instructions it needs to follow to modify it's
> own view of the reference set. Once it knows what the operations are, it
> begins making the necessary changes to the reference set.
> First, it sees two Index representations, so it looks those index values
> up in the header table (#1). It finds Index #0 and Index #2 in the static
> table so it adds those name+value pairs to it's reference set. Next, it
> finds the Literal With Indexing instruction for "foo1 = bar". Since it is
> "With Indexing", the server adds the name+value pair to the reference set
> *AND* to the header table. Next it finds the Literal Without Indexing
> instructions for "foo2 = baz". Since it is "Without Indexing", the server
> only adds the name+value pair to the reference set and notes the fact that
> the header was not indexed at all.
>
> At this point, the sender and receiver should have identical views of both
> the header table and reference set.
>
> So far so good.
>
> Let's move on... back on the sending side. Now I want to send another
> request to the server. Some of the header fields remain the same, some have
> different values, some are no longer used at all.
>
>   :method = POST
>   :path = /
>   foo1 = new
>   foo3 = new
>
> When serializing these headers, I'm going to follow the same process, but
> since our header table and reference set are both different, the serialized
> results are going to be a bit different.
>
> Going through each of the headers, I first determine that :method = POST
> is already contained in the header table at Index #1. I also note that
> :path = / is also in the header table using Index #2 (this hasn't changed).
>
> Also looking at the header table, I see that there is a "foo1" header with
> a different value sitting at Index #3. I make note of that index position
> so I can use it next.
>
> Lastly, I determine that the new header field "foo3 = new" is not in
> header table at all. I'm going to have to use the Literal representation
> for that one.
>
> Once I've determined the basic type of representation, I need to determine
> how the current reference set needs to be changed. As a reminder, here's
> what the current reference set looks like:
>
>   :method = GET
>   :path = /
>   foo1 = bar
>   foo2 = baz
>
> Immediately, I see that :method = GET needs to be replaced with :method =
> POST, since we know the static index positions for both of those, we can
> simple turn one off by referencing it's index and turn the other on (also
> by referencing it's index). That requires a simple two-byte sequence (0x81
> 0x82).
>
> I see that :path = / is already in the reference set. So for that one, I
> don't have to do anything! Awesome. Let's move on.
>
> We have an index position for the header field foo1 but not the full
> name+value pair "foo1=bar". Remember when I briefly mentioned
> Incremental Indexing vs. Substitution indexing? We need to make a
> decision. When Incremental Indexing, I can add "foo1 = new" as an entirely
> new entry in the dynamic table (#2) or I can replace the value of the
> existing "foo1 = bar" entry sitting at index #3. The choice is entirely up
> to the sender and can be based on a number of factors... the most important
> of which is basic memory management. For now, let's keep things simple and
> use Incremental Indexing. We want to minimize the amount of data we're
> going to send, however, so instead of sending the header field name as a
> length prefixed string, we're going to reference the existing index
> position and send along a new value ( 0x44 0x03 0x6E 0x65 0x77 ). (again, I
> have to bump up the value of index value by 1)
>
> Moving on, we determine that "foo2 = baz" needs to be removed from the
> reference set. We see, however, that "foo2 = baz" was previously sent as a
> Literal without any indexing. This means that it will automatically be
> removed from the reference set the next time around without the sender
> having to do anything! That's helpful.
>
> Now all we need to deal with is the new "foo2 = new" header field, we'll
> do that using Literal With Indexing, like we did originally with the "foo1
> = bar" header field in the first request..  0x40 0x04 0x67 0x6F 0x6F 0x33
> 0x03 0x6E 0x65 0x77.
>
> The complete sequence of octets for the second request is:
>
>   0x81 0x82 0x44 0x03
>   0x6E 0x65 0x77 0x40
>   0x04 0x67 0x6F 0x6F
>   0x33 0x03 0x6E 0x65
>   0x77
>
> After I construct this sequence, my header table has five entries:
>
>   0 = :method = GET
>   1 = :method = POST
>   2 = :path = /
>   3 = foo1 = bar
>   4 = foo1 = new
>
> And my reference set looks like:
>
>   :method = POST
>   :path = /
>   foo1 = new
>   foo3 = new
>
> I send the HEADERS frame to the server and it deconstructs the set of
> instructions, applies those to it's own reference set, updates it's header
> table, and goes on from there.
>
> The process can get a bit complicated at times, but that's the basic
> operation. There are a few other things to consider:
>
> The size of the dynamic table can be bound by the recipient. The sender is
> not permitted to exceed that size. If adding a new item to the dynamic
> table will cause the table to grow beyond this limit, existing name+value
> pairs need to be removed, starting with the
> name+value pairs that were least-recently written to the table.
>
> (((Ed. Note: We need clarity on this next part: )))
>
> That sounds straightforward, however, given the distinction between
> Incremental and Substitution indexing, the Least Recently Written rule can
> become a bit complicated in the implementation. Using Substitution indexing
> resets the Write time for that particular index, meaning that eviction does
> not necessary occur in Index order.
>
> For example, let's suppose that I send one HEADERS frame that
> incrementally adds three items to the dynamic table. Let's call those #1,
> #2 and #3. Later, I send a new HEADERS that substitutes a new value for
> Index #2. Last, I send a new HEADERS frame that adds two new headers for #4
> and #5. The only problem is, adding #4 and #5 to the dynamic table will
> cause it to exceed it's size limit, so I need to evict some of the existing
> entries. Because of the substitution of #2, Items #1 and #3 are the first
> to be evicted (in that order). So the dynamic table contents become #2, #4,
> and #5).
>
> (((Ed. Note: Or is it.... that Substitution DOES NOT change the
> Least-Recently-Written eviction order? In which case, a more recently
> substituted value can be dropped before a less-least-recently written
> Incremental value? That doesn't feel right to me...)))
>
> Eviction needs to be handled carefully with regards to the current
> reference set. Suppose, for instance, that in one request, I add "foo1 =
> bar" to the header table and to the current reference set. In the *same
> request*, I add "foo2 = baz" to the header table and to the current
> reference set, however, adding this second item causes the header table to
> exceed it's defined size, forcing the previously written "foo1 = bar" to be
> evicted...
>
> On the receiving end, I'll see the instruction to add "foo1 = bar"
> first, which will add it to the reference set and the header table, then
> I'll see the instruction to add "foo2 = baz", which adds it to the
> reference set and the header table, but evicts "foo1 = bar" from the
> dynamic table. Despite this eviction, both "foo1 = bar" and "foo2 = baz"
> are both STILL in the reference set.
>
> The next time around, "foo1 = bar" ends up being treated as a non-indexed
> literal and is dropped automatically from the reference set if it is not
> explicitly mentioned in the new set of headers.
>
> Another consideration that must be carefully weighed is the choice between
> Incremental and Substitution indexing. If an implementation always chooses
> to use Incremental Indexing, Eviction will become the only reliable means
> of memory management. Long lived connections will build up full compression
> contexts that tie up a potentially significant amount of memory. This could
> be dangerous, especially since there is no explicit mechanism for deleting
> items from the header table.
>
> Substitution Indexing provides only a partial solution to this problem by
> making it possible to reuse existing index positions. The problem with
> using Substitution indexing, however, is that it's not clear yet how it
> impact the LRU eviction semantics (see the Ed. Note above)
>
> ...
>
> Ok, I *think* that's a more accurate description. Do folks agree?
>
> - James
>
>
>
Received on Tuesday, 2 July 2013 06:21:58 UTC