Re: Header Compression Overview from James M Snell on 2013-07-02 (ietf-http-wg@w3.org from July to September 2013)

From: James M Snell <jasnell@gmail.com>
Date: Mon, 1 Jul 2013 17:41:04 -0700
To: Mike Bishop <Michael.Bishop@microsoft.com>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CABP7Rbfb074R-Bmd8jmDG2xx_7u22PCD2Xu3BUANhvL+MsUHcA@mail.gmail.com>
Ok, so I'm taking another look through the current header compression
draft and I'd swear that it said some things differently the last time
I read it ;-) ... (which is, of course, impossible... so I'm going to
blame it on the 110+ degree heat here today and the fact that I've
read so many different versions of header compression drafts over the
last few months that my specific recollection of any individual
version is a bit gooey in the middle). So.. with that said, allow me
to make a few corrections to the explanation... Let's see if this is a
bit closer to reality:

The sender and receiver each maintain their own synchronized view of
the compression context.

The compression context consists of two components:

  1. A dynamic table of name+value pairs
  2. The current "reference set" of name+value pairs

When the connection is first established the reference set (#2) is
empty and the header table is prepopulated.

The sender constructs a logical set of header name+value pairs, where
each name can have multiple values. Each distinct name+value
combination is treated independently of the others.

For each name+value pair in the set, the sender needs to determine
whether or not to use an Indexed or Literal representation. The
Indexed representation is appropriate if the name+value pair already
exists in either the header table (#1). The Literal representation is
appropriate if the name+value pair either does not
exist in #1, or if the sender does not want the name+value pair stored
in the compression context (more on this part later)

For example, let's assume that the header table initially consists of
three name+value pairs:

  0 = :method = GET
  1 = :method = POST
  2 = :path = /

Then, let's assume that I've created a new connection and the first
set of headers I send to the server contains the header fields:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

When I process this set of header fields, I first check each pair
against the header table and find that :method = GET uses the index
(0) while :path = / uses the index (2). Since I don't find foo1 or
foo2 in the static table, I determine that I have to use the Literal
representation for those two fields.

  :method = GET  ==>  Indexed as (0)
  :path = / ==> Indexed as (2)
  foo1 = bar ==> Literal
  foo2 = bar ==> Literal

Now, still on the sending side, once I determine the kind of
representation, I need to determine what changes need to be made to
the reference set (#2). The reference set is what the sender and
receiver both use to determine what set of header fields are currently
active. Since this is a brand new connection, the reference set is
empty, so with this first set of headers, the changes to the reference
set are straightforward... I'm adding everything...

To add an Indexed representation of a header, all I do is reference
it's index.. that's easy enough to do...

To add a Literal representation of a header, there are two options:
With Indexing, or Without Indexing. With Indexing means that when the
header field is added to the Reference set (#2), the name+value pair
is also added to the header table (#1). Without indexing means that
when the header field is added to the Reference set (#2) the
name+value pair is NOT added to the dynamic table. Which of these to
use to entirely up to the sender. So I'm going to decide that foo1 is
going to be With Indexing, while foo2 is going to be Without Indexing.

(there is another option here between incremental vs. substitution
indexing but I'll get to that in a few minutes)

At this point, I have a set of instructions of how to modify the reference set:

  1. Add Index #0
  2. Add Index #2
  3. Add Literal "foo1 = bar" with Indexing
  4. Add Literal "foo2 = baz" without Indexing

At this point, I can create the serialized header block that will be
transmitted within the HEADERS frame. This block ONLY contains the
serialized set of instructions for modifying the reference set.

For instructions 1 and 2 (adding index #0 and index #2), the
serialization is very simple and requires exactly two bytes (0x81
0x83). (I have to bump up the index values by 1 when serializing for
some reason)

For instruction #3, it's slightly more complicated, but not much... we
have a single prefix byte (0x40) followed by the length prefixed
header name (0x04 0x67 0x6F 0x6F 0x31) and the length prefixed header
value (0x03 0x62 0x61 0x72).

For instruction #4, the encoding is only slightly different. The
single prefix byte is 0x60 followed by the length prefixed header name
(0x04 0x67 0x6F 0x6F 0x32) and the length prefixed value (0x03 0x62
0x61 0x7A).

Putting it all together, the complete sequence of octets that I will
send within the HEADERS frame is:

 0x81 0x83 0x40 0x04
 0x67 0x6F 0x6F 0x31
 0x03 0x62 0x61 0x72
 0x60 0x04 0x67 0x6F
 0x6F 0x32 0x03 0x62
 0x61 0x7A

Once I generate this sequence, a new entry is added to the header table (#1):

  3 = foo1 = bar

And the reference set (#2) has been updated to include the four headers I sent:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

(this is a simplification actually but it's close enough for now)

I send the serialized instructions to the other endpoint. It receives
those and reverses the process first by parsing the octet sequence to
determine the exact set of instructions it needs to follow to modify
it's own view of the reference set. Once it knows what the operations
are, it begins making the necessary changes to the reference set.
First, it sees two Index representations, so it looks those index
values up in the header table (#1). It finds Index #0 and Index #2 in
the static table so it adds those name+value pairs to it's reference
set. Next, it finds the Literal With Indexing instruction for "foo1 =
bar". Since it is "With Indexing", the server adds the name+value pair
to the reference set *AND* to the header table. Next it finds the
Literal Without Indexing instructions for "foo2 = baz". Since it is
"Without Indexing", the server only adds the name+value pair to the
reference set and notes the fact that the header was not indexed at
all.

At this point, the sender and receiver should have identical views of
both the header table and reference set.

So far so good.

Let's move on... back on the sending side. Now I want to send another
request to the server. Some of the header fields remain the same, some
have different values, some are no longer used at all.

  :method = POST
  :path = /
  foo1 = new
  foo3 = new

When serializing these headers, I'm going to follow the same process,
but since our header table and reference set are both different, the
serialized results are going to be a bit different.

Going through each of the headers, I first determine that :method =
POST is already contained in the header table at Index #1. I also note
that :path = / is also in the header table using Index #2 (this hasn't
changed).

Also looking at the header table, I see that there is a "foo1" header
with a different value sitting at Index #3. I make note of that index
position so I can use it next.

Lastly, I determine that the new header field "foo3 = new" is not in
header table at all. I'm going to have to use the Literal
representation for that one.

Once I've determined the basic type of representation, I need to
determine how the current reference set needs to be changed. As a
reminder, here's what the current reference set looks like:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

Immediately, I see that :method = GET needs to be replaced with
:method = POST, since we know the static index positions for both of
those, we can simple turn one off by referencing it's index and turn
the other on (also by referencing it's index). That requires a simple
two-byte sequence (0x81 0x82).

I see that :path = / is already in the reference set. So for that one,
I don't have to do anything! Awesome. Let's move on.

We have an index position for the header field foo1 but not the full
name+value pair "foo1=bar". Remember when I briefly mentioned
Incremental Indexing vs. Substitution indexing? We need to make a
decision. When Incremental Indexing, I can add "foo1 = new" as an
entirely new entry in the dynamic table (#2) or I can replace the
value of the existing "foo1 = bar" entry sitting at index #3. The
choice is entirely up to the sender and can be based on a number of
factors... the most important of which is basic memory management. For
now, let's keep things simple and use Incremental Indexing. We want to
minimize the amount of data we're going to send, however, so instead
of sending the header field name as a length prefixed string, we're
going to reference the existing index position and send along a new
value ( 0x44 0x03 0x6E 0x65 0x77 ). (again, I have to bump up the
value of index value by 1)

Moving on, we determine that "foo2 = baz" needs to be removed from the
reference set. We see, however, that "foo2 = baz" was previously sent
as a Literal without any indexing. This means that it will
automatically be removed from the reference set the next time around
without the sender having to do anything! That's helpful.

Now all we need to deal with is the new "foo2 = new" header field,
we'll do that using Literal With Indexing, like we did originally with
the "foo1 = bar" header field in the first request..  0x40 0x04 0x67
0x6F 0x6F 0x33 0x03 0x6E 0x65 0x77.

The complete sequence of octets for the second request is:

  0x81 0x82 0x44 0x03
  0x6E 0x65 0x77 0x40
  0x04 0x67 0x6F 0x6F
  0x33 0x03 0x6E 0x65
  0x77

After I construct this sequence, my header table has five entries:

  0 = :method = GET
  1 = :method = POST
  2 = :path = /
  3 = foo1 = bar
  4 = foo1 = new

And my reference set looks like:

  :method = POST
  :path = /
  foo1 = new
  foo3 = new

I send the HEADERS frame to the server and it deconstructs the set of
instructions, applies those to it's own reference set, updates it's
header table, and goes on from there.

The process can get a bit complicated at times, but that's the basic
operation. There are a few other things to consider:

The size of the dynamic table can be bound by the recipient. The
sender is not permitted to exceed that size. If adding a new item to
the dynamic table will cause the table to grow beyond this limit,
existing name+value pairs need to be removed, starting with the
name+value pairs that were least-recently written to the table.

(((Ed. Note: We need clarity on this next part: )))

That sounds straightforward, however, given the distinction between
Incremental and Substitution indexing, the Least Recently Written rule
can become a bit complicated in the implementation. Using Substitution
indexing resets the Write time for that particular index, meaning that
eviction does not necessary occur in Index order.

For example, let's suppose that I send one HEADERS frame that
incrementally adds three items to the dynamic table. Let's call those
#1, #2 and #3. Later, I send a new HEADERS that substitutes a new
value for Index #2. Last, I send a new HEADERS frame that adds two new
headers for #4 and #5. The only problem is, adding #4 and #5 to the
dynamic table will cause it to exceed it's size limit, so I need to
evict some of the existing entries. Because of the substitution of #2,
Items #1 and #3 are the first to be evicted (in that order). So the
dynamic table contents become #2, #4, and #5).

(((Ed. Note: Or is it.... that Substitution DOES NOT change the
Least-Recently-Written eviction order? In which case, a more recently
substituted value can be dropped before a less-least-recently written
Incremental value? That doesn't feel right to me...)))

Eviction needs to be handled carefully with regards to the current
reference set. Suppose, for instance, that in one request, I add "foo1
= bar" to the header table and to the current reference set. In the
*same request*, I add "foo2 = baz" to the header table and to the
current reference set, however, adding this second item causes the
header table to exceed it's defined size, forcing the previously
written "foo1 = bar" to be evicted...

On the receiving end, I'll see the instruction to add "foo1 = bar"
first, which will add it to the reference set and the header table,
then I'll see the instruction to add "foo2 = baz", which adds it to
the reference set and the header table, but evicts "foo1 = bar" from
the dynamic table. Despite this eviction, both "foo1 = bar" and "foo2
= baz" are both STILL in the reference set.

The next time around, "foo1 = bar" ends up being treated as a
non-indexed literal and is dropped automatically from the reference
set if it is not explicitly mentioned in the new set of headers.

Another consideration that must be carefully weighed is the choice
between Incremental and Substitution indexing. If an implementation
always chooses to use Incremental Indexing, Eviction will become the
only reliable means of memory management. Long lived connections will
build up full compression contexts that tie up a potentially
significant amount of memory. This could be dangerous, especially
since there is no explicit mechanism for deleting items from the
header table.

Substitution Indexing provides only a partial solution to this problem
by making it possible to reuse existing index positions. The problem
with using Substitution indexing, however, is that it's not clear yet
how it impact the LRU eviction semantics (see the Ed. Note above)

...

Ok, I *think* that's a more accurate description. Do folks agree?

- James
Received on Tuesday, 2 July 2013 00:41:51 UTC