Header Compression Overview from James M Snell on 2013-07-01 (ietf-http-wg@w3.org from July to September 2013)

From: James M Snell <jasnell@gmail.com>
Date: Mon, 1 Jul 2013 11:32:30 -0700
To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CABP7RbfrGwdU_RM+kjdRfLZ-FH8u1ZwuwNZ7YfRujGc=CW+Duw@mail.gmail.com>
Based on some of the questions I've seen posted so far, there seems to
be some confusion still on the header compression mechanism... Let me
see if I can help clear up some of the confusion.

The sender and receiver each maintain their own synchronized view of
the compression context.

The compression context consists of three components:

  1. A static table of common name+value pairs.
  2. A dynamic table of name+value pairs
  3. The current "reference set" of name+value pairs

When the connection is first established, both the dynamic table (#2)
and the reference set (#3) are empty.

The sender constructs a logical set of header name+value pairs, where
each name can have multiple values. Each distinct name+value
combination is treated independently of the others.

For each name+value pair in the set, the sender needs to determine
whether or not to use an Indexed or Literal representation. The
Indexed representation is appropriate if the name+value pair already
exists in either the static (#1) or dynamic (#2) tables. The Literal
representation is appropriate if the name+value pair either does not
exist in #1 or #2, or if the sender does not want the name+value pair
stored in the compression context (more on this part later)

For example, let's assume that the static table consists only of two
name+value pairs:

  0 = :method = GET
  1 = :method = POST
  2 = :path = /

Then, let's assume that I've created a new connection and the first
HEADERS frame I send to the server contains the header fields:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

When I process this set of header fields, I first check each pair
against the static table and find that :method = GET uses the index
(0) while :path = / uses the index (2). Since I don't find foo1 or
foo2 in the static table, I check the dynamic table. Since the
connection is brand new and the dynamic table is empty, I determine
that I have to use the Literal representation for those two fields.

  :method = GET  ==>  Indexed as (0)
  :path = / ==> Indexed as (2)
  foo1 = bar ==> Literal
  foo2 = bar ==> Literal

Now, still on the sending side, once I determine the kind of
representation, I need to determine what changes need to be made to
the reference set (#3). The reference set is what the sender and
receiver both use to determine what set of header fields are currently
active. Since this is a brand new connection, the reference set is
empty, so with this first set of headers, the changes to the reference
set are straightforward... I'm adding everything...

To add an Indexed representation of a header, all I do is reference
it's index.. that's easy enough to do...

To add a Literal representation of a header, there are two options:
With Indexing, or Without Indexing. With Indexing means that when the
header field is added to the Reference set (#3), the name+value pair
is also added to the dynamic table (#2). Without indexing means that
when
the header field is added to the Reference set (#3) the name+value
pair is NOT added to the dynamic table. Which of these to use to
entirely up to the sender. So I'm going to decide that foo1 is going
to be With Indexing, while foo2 is going to be Without Indexing.

(there is another option here between incremental vs. substitution
indexing but I'll get to that in a few minutes)

At this point, I have a set of instructions of how to modify the reference set:

  1. Add Index #0
  2. Add Index #2
  3. Add Literal "foo1 = bar" with Indexing
  4. Add Literal "foo2 = baz" without Indexing

At this point, I can create the serialized header block that will be
transmitted within the HEADERS frame. This block ONLY contains the
serialized set of instructions for modifying the reference set.

For instructions 1 and 2 (adding index #0 and index #1), the
serialization is very simple and requires exactly two bytes (0x80
0x82).

For instruction #3, it's slightly more complicated, but not much... we
have a single prefix byte (0x40) followed by the length prefixed
header name (0x04 0x67 0x6F 0x6F 0x31) and the length prefixed header
value (0x03 0x62 0x61 0x72).

For instruction #4, the encoding is only slightly different. The
single prefix byte is 0x60 followed by the length prefixed header name
(0x04 0x67 0x6F 0x6F 0x32) and the length prefixed value (0x03 0x62
0x61 0x7A).

Putting it all together, the complete sequence of octets that I will
send within the HEADERS frame is:

 0x80 0x82 0x40 0x04
 0x67 0x6F 0x6F 0x31
 0x03 0x62 0x61 0x72
 0x60 0x04 0x67 0x6F
 0x6F 0x32 0x03 0x62
 0x61 0x7A

Once I generate this sequence, a new entry is added to the dynamic table (#2):

  3 = foo1 = bar

And the reference set (#3) has been updated to include the four headers I sent:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

I send the serialized instructions to the other endpoint. It receives
those and reverses the process first by parsing the octet sequence to
determine the exact set of instructions it needs to follow to modify
it's own view of the reference set. Once it knows what the operations
are, it begins making the necessary changes to the reference set.
First, it sees two Index representations, so it looks those index
values up in the Static (#1) and Dynamic tables (#2). It finds Index
#0 and Index #1 in the static table so it adds those name+value pairs
to it's reference set (#3). Next, it finds the Literal With Indexing
instruction for "foo1 = bar". Since it is "With Indexing", the server
adds the name+value pair to the reference set (#3) *AND* to the
dynamic table (#2). Next it finds the Literal Without Indexing
instructions for "foo2 = baz". Since it is "Without Indexing", the
server only adds the name+value pair to the reference set and notes
the fact that the header was not indexed at all.

At this point, the sender and receiver should have identical views of
both the dynamic table (#2) and reference set (#3).

So far so good.

Let's move on... back on the sending side. Now I want to send another
request to the server. Some of the header fields remain the same, some
have different values, some are no longer used at all.

  :method = POST
  :path = /
  foo1 = new
  foo3 = new

When serializing these headers, I'm going to follow the same process,
but since our dynamic table (#2) and reference set (#3) are both no
longer empty, the results are going to be a bit different.

Going through each of the headers, I first determine that :method =
POST is already contained in the static set using Index #1. I also
note that :path = / is also in the static set using Index #2 (this
hasn't changed).

Looking at the dynamic table (#2), I see that there is a "foo1" header
with a different value sitting at Index #3. I make note of that index
position so I can use it next.

Lastly, I determine that the new header field "foo3 = new" is not in
either the static or dynamic set. I'm going to have to use the Literal
representation for that one.

Once I've determined the basic type of representation, I need to
determine how the current reference set needs to be changed. As a
reminder, here's what the current reference set looks like:

  :method = GET
  :path = /
  foo1 = bar
  foo2 = baz

Immediately, I see that :method = GET needs to be replaced with
:method = POST, since we know the static index positions for both of
those, we can simple turn one off by referencing it's index and turn
the other on (also by referencing it's index). That requires a simple
two-byte sequence (0x80 0x81).

I see that :path = / is already in the reference set. So for that one,
I don't have to do anything! Awesome. Let's move on.

We have an index position for the header field foo1 but not the full
name+value pair "foo1=bar". Remember when I briefly mentioned
Incremental Indexing vs. Substitution indexing? We need to make a
decision. When Incremental Indexing, I can add "foo1 = new" as an
entirely new entry in the dynamic table (#2) or I can replace the
value of the existing "foo1 = bar" entry sitting at index #3. The
choice is entirely up to the sender and can be based on a number of
factors... the most important of which is basic memory management. For
now, let's keep things simple and use Incremental Indexing. We want to
minimize the amount of data we're going to send, however, so instead
of sending the header field name as a length prefixed string, we're
going to reference the existing index position and send along a new
value ( 0x43 0x03 0x6E 0x65 0x77 ).

Moving on, we determine that "foo2 = baz" needs to be removed from the
reference set. We see, however, that "foo2 = baz" was previously sent
as a Literal without any indexing. This means that it will
automatically be removed from the reference set the next time around
without the sender having to do anything! That's helpful.

Now all we need to deal with is the new "foo2 = new" header field,
we'll do that using Literal With Indexing, like we did originally with
the "foo1 = bar" header field in the first request..  0x40 0x04 0x67
0x6F 0x6F 0x33 0x03 0x6E 0x65 0x77.

The complete sequence of octets for the second request is:

  0x80 0x81 0x43 0x03
  0x6E 0x65 0x77 0x40
  0x04 0x67 0x6F 0x6F
  0x33 0x03 0x6E 0x65
  0x77

After I construct this sequence, my dynamic table has two entries:

  foo1 = bar
  foo1 = new

And my reference set looks like:

  :method = POST
  :path = /
  foo1 = new
  foo3 = new

I send the HEADERS frame to the server and it deconstructs the set of
instructions, applies those to it's own reference set, updates it's
dynamic table, and goes on from there.

The process can get a bit complicated at times, but that's the basic
operation. There are a few other things to consider:

The size of the dynamic table can be bound by the recipient. The
sender is not permitted to exceed that size. If adding a new item to
the dynamic table will cause the table to grow beyond this limit,
existing name+value pairs need to be removed, starting with the
name+value pairs that were least-recently written to the table.

That sounds straightforward, however, given the distinction between
Incremental and Substitution indexing, the Least Recently Written rule
can become a bit complicated in the implementation. Using Substitution
indexing resets the Write time for that particular index, meaning that
eviction does not necessary occur in Index order.

For example, let's suppose that I send one HEADERS frame that
incrementally adds three items to the dynamic table. Let's call those
#1, #2 and #3. Later, I send a new HEADERS that substitutes a new
value for Index #2. Last, I send a new HEADERS frame that adds two new
headers for #4 and #5. The only problem is, adding #4 and #5 to the
dynamic table will cause it to exceed it's size limit, so I need to
evict some of the existing entries. Because of the substitution of #2,
Items #1 and #3 are the first to be evicted (in that order). So the
dynamic table contents become #2, #4, and #5).

Eviction needs to be handled carefully with regards to the current
reference set. Suppose, for instance, that in one request, I add "foo1
= bar" to the dynamic table and to the current reference set. In the
*same request*, I add "foo2 = baz" to the dynamic table and to the
current reference set, however, adding this second item causes the
dynamic table to exceed it's defined size, forcing the previously
written "foo1 = bar" to be evicted...

On the receiving end, I'll see the instruction to add "foo1 = bar"
first, which will add it to the reference set and the dynamic table,
then I'll see the instruction to add "foo2 = baz", which adds it to
the reference set and the dynamic table, but evicts "foo1 = bar" from
the dynamic table. Despite this eviction, both "foo1 = bar" and "foo2
= baz" are both STILL in the reference set.

The next time around, "foo1 = bar" ends up being treated as a
non-indexed literal and is dropped automatically from the reference
set if it is not explicitly mentioned in the new set of headers.

Another consideration that must be carefully weighed is the choice
between Incremental and Substitution indexing. If an implementation
always chooses to use Incremental Indexing, Eviction will become the
only reliable means of memory management. Long lived connections will
build up full compression contexts that tie up a potentially
significant amount of memory. This could be dangerous, especially
since there is no explicit mechanism for deleting items from the
dynamic table.

Anther challenge with Incremental Indexing is that index values grow
monotonically without any upper limit. Index positions are represented
in the serialization as variable width integers without any maximum
number of bytes. This means that long lived connections could end up
with quite large index positions that consume a large number of bytes.

Substitution Indexing provides only a partial solution to this problem
by making it possible to reuse existing index positions so long as the
header field name remain the same. There still exists the possibility
of tying up memory, however, if there are a large number of different
header names.

Through clever implementation, these memory management issues can be
addressed and mitigated, but implementers need to keep aware of the
issues.

Anyway, I hope that gives a helpful overview of the header compression
process and answers any of the lingering questions :-)

Roberto and Herve: If I missed any points, please speak up! :-)
Received on Monday, 1 July 2013 18:33:18 UTC