RE: Header Compression Overview from Mike Bishop on 2013-07-02 (ietf-http-wg@w3.org from July to September 2013)

From: Mike Bishop <Michael.Bishop@microsoft.com>
Date: Tue, 2 Jul 2013 03:29:20 +0000
To: James M Snell <jasnell@gmail.com>
CC: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <756654949a1c4916bdfd6dac497137b4@BY2PR03MB025.namprd03.prod.outlook.com>
Yes -- we just need to clarify the eviction behavior.  I actually prefer the evict-from-front model, because it means there's no need for metadata to include a timestamp on each entry in the table.  You know which item you're going to evict, because it's clear.

-----Original Message-----
From: James M Snell [mailto:jasnell@gmail.com] 
Sent: Monday, July 1, 2013 5:41 PM
To: Mike Bishop
Cc: ietf-http-wg@w3.org
Subject: Re: Header Compression Overview

Ok, so I'm taking another look through the current header compression draft and I'd swear that it said some things differently the last time I read it ;-) ... (which is, of course, impossible... so I'm going to blame it on the 110+ degree heat here today and the fact that I've read so many different versions of header compression drafts over the last few months that my specific recollection of any individual version is a bit gooey in the middle). So.. with that said, allow me to make a few corrections to the explanation... Let's see if this is a bit closer to reality:

The sender and receiver each maintain their own synchronized view of the compression context.

The compression context consists of two components:

  1. A dynamic table of name+value pairs
  2. The current "reference set" of name+value pairs

When the connection is first established the reference set (#2) is empty and the header table is prepopulated.

The sender constructs a logical set of header name+value pairs, where each name can have multiple values. Each distinct name+value combination is treated independently of the others.

For each name+value pair in the set, the sender needs to determine whether or not to use an Indexed or Literal representation. The Indexed representation is appropriate if the name+value pair already exists in either the header table (#1). The Literal representation is appropriate if the name+value pair either does not exist in #1, or if the sender does not want the name+value pair stored in the compression context (more on this part later)

For example, let's assume that the header table initially consists of three name+value pairs:

  0 = :method = GET
  1 = :method = POST
  2 = :path = /

Then, let's assume that I've created a new connection and the first set of headers I send to the server contains the header fields:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

When I process this set of header fields, I first check each pair against the header table and find that :method = GET uses the index
(0) while :path = / uses the index (2). Since I don't find foo1 or
foo2 in the static table, I determine that I have to use the Literal representation for those two fields.

  :method = GET  ==>  Indexed as (0)
  :path = / ==> Indexed as (2)
  foo1 = bar ==> Literal
  foo2 = bar ==> Literal

Now, still on the sending side, once I determine the kind of representation, I need to determine what changes need to be made to the reference set (#2). The reference set is what the sender and receiver both use to determine what set of header fields are currently active. Since this is a brand new connection, the reference set is empty, so with this first set of headers, the changes to the reference set are straightforward... I'm adding everything...

To add an Indexed representation of a header, all I do is reference it's index.. that's easy enough to do...

To add a Literal representation of a header, there are two options:
With Indexing, or Without Indexing. With Indexing means that when the header field is added to the Reference set (#2), the name+value pair is also added to the header table (#1). Without indexing means that when the header field is added to the Reference set (#2) the
name+value pair is NOT added to the dynamic table. Which of these to
use to entirely up to the sender. So I'm going to decide that foo1 is going to be With Indexing, while foo2 is going to be Without Indexing.

(there is another option here between incremental vs. substitution indexing but I'll get to that in a few minutes)

At this point, I have a set of instructions of how to modify the reference set:

  1. Add Index #0
  2. Add Index #2
  3. Add Literal "foo1 = bar" with Indexing
  4. Add Literal "foo2 = baz" without Indexing

At this point, I can create the serialized header block that will be transmitted within the HEADERS frame. This block ONLY contains the serialized set of instructions for modifying the reference set.

For instructions 1 and 2 (adding index #0 and index #2), the serialization is very simple and requires exactly two bytes (0x81 0x83). (I have to bump up the index values by 1 when serializing for some reason)

For instruction #3, it's slightly more complicated, but not much... we have a single prefix byte (0x40) followed by the length prefixed header name (0x04 0x67 0x6F 0x6F 0x31) and the length prefixed header value (0x03 0x62 0x61 0x72).

For instruction #4, the encoding is only slightly different. The single prefix byte is 0x60 followed by the length prefixed header name
(0x04 0x67 0x6F 0x6F 0x32) and the length prefixed value (0x03 0x62
0x61 0x7A).

Putting it all together, the complete sequence of octets that I will send within the HEADERS frame is:

 0x81 0x83 0x40 0x04
 0x67 0x6F 0x6F 0x31
 0x03 0x62 0x61 0x72
 0x60 0x04 0x67 0x6F
 0x6F 0x32 0x03 0x62
 0x61 0x7A

Once I generate this sequence, a new entry is added to the header table (#1):

  3 = foo1 = bar

And the reference set (#2) has been updated to include the four headers I sent:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

(this is a simplification actually but it's close enough for now)

I send the serialized instructions to the other endpoint. It receives those and reverses the process first by parsing the octet sequence to determine the exact set of instructions it needs to follow to modify it's own view of the reference set. Once it knows what the operations are, it begins making the necessary changes to the reference set.
First, it sees two Index representations, so it looks those index values up in the header table (#1). It finds Index #0 and Index #2 in the static table so it adds those name+value pairs to it's reference set. Next, it finds the Literal With Indexing instruction for "foo1 = bar". Since it is "With Indexing", the server adds the name+value pair to the reference set *AND* to the header table. Next it finds the Literal Without Indexing instructions for "foo2 = baz". Since it is "Without Indexing", the server only adds the name+value pair to the reference set and notes the fact that the header was not indexed at all.

At this point, the sender and receiver should have identical views of both the header table and reference set.

So far so good.

Let's move on... back on the sending side. Now I want to send another request to the server. Some of the header fields remain the same, some have different values, some are no longer used at all.

  :method = POST
  :path = /
  foo1 = new
  foo3 = new

When serializing these headers, I'm going to follow the same process, but since our header table and reference set are both different, the serialized results are going to be a bit different.

Going through each of the headers, I first determine that :method = POST is already contained in the header table at Index #1. I also note that :path = / is also in the header table using Index #2 (this hasn't changed).

Also looking at the header table, I see that there is a "foo1" header with a different value sitting at Index #3. I make note of that index position so I can use it next.

Lastly, I determine that the new header field "foo3 = new" is not in header table at all. I'm going to have to use the Literal representation for that one.

Once I've determined the basic type of representation, I need to determine how the current reference set needs to be changed. As a reminder, here's what the current reference set looks like:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

Immediately, I see that :method = GET needs to be replaced with :method = POST, since we know the static index positions for both of those, we can simple turn one off by referencing it's index and turn the other on (also by referencing it's index). That requires a simple two-byte sequence (0x81 0x82).

I see that :path = / is already in the reference set. So for that one, I don't have to do anything! Awesome. Let's move on.

We have an index position for the header field foo1 but not the full
name+value pair "foo1=bar". Remember when I briefly mentioned
Incremental Indexing vs. Substitution indexing? We need to make a decision. When Incremental Indexing, I can add "foo1 = new" as an entirely new entry in the dynamic table (#2) or I can replace the value of the existing "foo1 = bar" entry sitting at index #3. The choice is entirely up to the sender and can be based on a number of factors... the most important of which is basic memory management. For now, let's keep things simple and use Incremental Indexing. We want to minimize the amount of data we're going to send, however, so instead of sending the header field name as a length prefixed string, we're going to reference the existing index position and send along a new value ( 0x44 0x03 0x6E 0x65 0x77 ). (again, I have to bump up the value of index value by 1)

Moving on, we determine that "foo2 = baz" needs to be removed from the reference set. We see, however, that "foo2 = baz" was previously sent as a Literal without any indexing. This means that it will automatically be removed from the reference set the next time around without the sender having to do anything! That's helpful.

Now all we need to deal with is the new "foo2 = new" header field, we'll do that using Literal With Indexing, like we did originally with the "foo1 = bar" header field in the first request..  0x40 0x04 0x67 0x6F 0x6F 0x33 0x03 0x6E 0x65 0x77.

The complete sequence of octets for the second request is:

  0x81 0x82 0x44 0x03
  0x6E 0x65 0x77 0x40
  0x04 0x67 0x6F 0x6F
  0x33 0x03 0x6E 0x65
  0x77

After I construct this sequence, my header table has five entries:

  0 = :method = GET
  1 = :method = POST
  2 = :path = /
  3 = foo1 = bar
  4 = foo1 = new

And my reference set looks like:

  :method = POST
  :path = /
  foo1 = new
  foo3 = new

I send the HEADERS frame to the server and it deconstructs the set of instructions, applies those to it's own reference set, updates it's header table, and goes on from there.

The process can get a bit complicated at times, but that's the basic operation. There are a few other things to consider:

The size of the dynamic table can be bound by the recipient. The sender is not permitted to exceed that size. If adding a new item to the dynamic table will cause the table to grow beyond this limit, existing name+value pairs need to be removed, starting with the
name+value pairs that were least-recently written to the table.

(((Ed. Note: We need clarity on this next part: )))

That sounds straightforward, however, given the distinction between Incremental and Substitution indexing, the Least Recently Written rule can become a bit complicated in the implementation. Using Substitution indexing resets the Write time for that particular index, meaning that eviction does not necessary occur in Index order.

For example, let's suppose that I send one HEADERS frame that incrementally adds three items to the dynamic table. Let's call those #1, #2 and #3. Later, I send a new HEADERS that substitutes a new value for Index #2. Last, I send a new HEADERS frame that adds two new headers for #4 and #5. The only problem is, adding #4 and #5 to the dynamic table will cause it to exceed it's size limit, so I need to evict some of the existing entries. Because of the substitution of #2, Items #1 and #3 are the first to be evicted (in that order). So the dynamic table contents become #2, #4, and #5).

(((Ed. Note: Or is it.... that Substitution DOES NOT change the Least-Recently-Written eviction order? In which case, a more recently substituted value can be dropped before a less-least-recently written Incremental value? That doesn't feel right to me...)))

Eviction needs to be handled carefully with regards to the current reference set. Suppose, for instance, that in one request, I add "foo1 = bar" to the header table and to the current reference set. In the *same request*, I add "foo2 = baz" to the header table and to the current reference set, however, adding this second item causes the header table to exceed it's defined size, forcing the previously written "foo1 = bar" to be evicted...

On the receiving end, I'll see the instruction to add "foo1 = bar"
first, which will add it to the reference set and the header table, then I'll see the instruction to add "foo2 = baz", which adds it to the reference set and the header table, but evicts "foo1 = bar" from the dynamic table. Despite this eviction, both "foo1 = bar" and "foo2 = baz" are both STILL in the reference set.

The next time around, "foo1 = bar" ends up being treated as a non-indexed literal and is dropped automatically from the reference set if it is not explicitly mentioned in the new set of headers.

Another consideration that must be carefully weighed is the choice between Incremental and Substitution indexing. If an implementation always chooses to use Incremental Indexing, Eviction will become the only reliable means of memory management. Long lived connections will build up full compression contexts that tie up a potentially significant amount of memory. This could be dangerous, especially since there is no explicit mechanism for deleting items from the header table.

Substitution Indexing provides only a partial solution to this problem by making it possible to reuse existing index positions. The problem with using Substitution indexing, however, is that it's not clear yet how it impact the LRU eviction semantics (see the Ed. Note above)

...

Ok, I *think* that's a more accurate description. Do folks agree?

- James
Received on Tuesday, 2 July 2013 03:30:19 UTC