Header Serialization Discussion

Ok... so I've implemented HeaderDiff serialization [1] in java [2]...
just the serialization so far, will do deserialization early next
week. The implementation is very rough and definitely needs
improvement / optimization but it's functional enough to start from.

[1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00
[2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff

I know that Roberto is working on refinements to his own concept of
the delta header serialization and hopefully he'll be sharing those
thoughts soon, but I wanted to get my current thoughts captured for
discussion.

The HeaderDiff approach is straightforward and effective and makes
efficient use of available bits. It's variable length integer syntax
is a bit complicated with the notion of bit-prefix lengths but once
you've got it it's easy to understand what's going on. The
implementation is a bit less complicated than delta, which is good and
I like that there's no "Header Group" notion. Headers are either on or
off per request. There are several concerns, however.

  - The true utility of the common prefix length mechanism is
questionable. Aside from the potential security risks, I questioning
just how effective it's going to be in practice. (What header fields
do we expect to actually use it in practice?)

  - The fact that items are never removed from the Name Table is
concerning and poses a potential security risk. The the decompressor
is forced to maintain a symbol table for every header name it
encounters, a malicious compressor could cause overruns by sending a
high number of junk header names. The compressor ought to be able to
treat the Name Table as an LRU, and ought to be able to place strict
limits on it's size, just as it does with the Header Table. Delta does
not have this issue because it's name indices are already tied to the
LRU.

With HeaderDiff in mind, I'm thinking about how we can bring
HeaderDiff and Delta together and roll in the additional concepts of
Typed Codecs. Here's what I have in mind.

Syntax:

The serialization syntax borrows concepts from both HeaderDiff and
Delta and is divided into two parts: Basic Header Serialization and
Value Serialization.

As with Delta currently, there are four operation codes (opcodes),
represented by two-bit identifiers. Each op code has a single-bit
"ephemeral flag". Operations are serialized into batches of no more
than 32 instances.

  00 0 00000 { HEADER }

The first two most significant bits...

  00 = Index (what Delta calls "Toggl")
  01 = Index Range (what Delta calls "Trang")
  10 = Clone
  11 = Literal (what Delta calls "KVSto")

The third most significant bit is the ephemeral toggle

  0 = Persist (the header value is to be persisted in the LRU table)
  1 = Ephemeral (the header value is not to be persisted in the LRU table)

The remaining five bits specify the number of distinct instances in
this batch. This counter is 0-based, so 0x0 = 1 item in the batch,
0x1F = 32 items in the batch.

The ephemeral flag is ignored for opcodes 00 and 01.

Indexed headers are identified by a single byte index value.

  0 0000000

The first most significant bit indicates whether or not the index is
part of the static header dictionary or lru cache.

  0 = LRU
  1 = Static

This gives us a strict maximum of 128 storage slots for the LRU.
That's limited *on purpose* to place strictly finite constraints on
the amount of state any single compressor can require. This also
ensures that we are always able to encode indices using only a single
byte.

As for state management, we would essentially use the Header Diff
model minus the separate Name Table. That is, there is no persistent
Header Group in use; just the Header Table. State management depends
on the specific opcodes used. State management occurs as the headers
are encountered in the serialized header block.

  00 = Use an indexed header instance from the header table for this request
  01 = Use multiple sequentially indexed header instances from the
header table for this request
  10 = Use a new value for an already existing header name, index
references an existing header from the table, persist the new value if
the ephemeral bit is not set.
  11 = Use a new key and value for this request. Persist the new pair
if the ephemeral bit is not set.

So, for example, the serialization:

  00000000 10000000 =
    Use index #0 from the static dictionary

  00000001 10000000 00000001 =
    Use index #0 from the static dictionary and index #1 from the lru cache

  01000001 10000000 10000001 00000001 00000010 =
    Use indices #0-#1 from the static dictionary and indices #1-#2
from the lru cache

  10000000 10000000 {VALUE}
    Provide a new value for the header name identified by index #0 in
the static dictionary and persist the value in the lru cache

  10100000 10000000 {VALUE}
    Provide a new value for the header name identified by index #0 in
the static dictionary but do not persist the value in the lru cache

Non-ephemeral items are pushed into the next available slot in the
LRU. Once all available slots have been used, the indices rotate. The
LRU can also be limited by storage space. As storage space is
required, the least recently items are dropped. Keeping in mind that
there are only 128 available storage slots, it will be important for
compressors to be selective about which fields are persisted.

(note.. the LRU is really a "least recently written" cache.. not
"least recently used"...)

These values are tracked only for the current message. There is no
Header Group as currently defined by Delta so there is no need to
track whether or not any given header is currently on or off. It's
either referenced for this request or it's not.

If there are more than 32 instances of any single opcode per
serialized header block (which is unlikely), a single opcode may
appear multiple times within the serialized block.

As far as Value serialization is concerned... As I have discussed
previously, I have been looking at support for four distinct value
types, each represented by their own two-bit identifier. As I will
explain further in a second, these types are represented using the two
most significant bits in a byte:

  00 = Text
  01 = Number
  10 = Date Time
  11 = Binary

Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
following the type code. All text strings are prefixed by it's length
given as an unsigned variant length integer
Numbers are encoded as unsigned variable length integers.
Date Times are encoded as Numbers representing the number of seconds
that have passed since the epoch (GMT).
Binary data is encoded as raw binary octets prefixed by an unsigned
variable length integer specifying the number of bytes.

Any single header value instance can consist of up to 32 distinct
value tokens. The number of tokens is specified using the 5 least
significant bytes of type type flag.

For instance:

  00000000 00000001 01100001 =
    A single ISO-8859-1 Text value with length = 1

  00100000 00000001 01100001  =
    A single UTF-8 Text value with length = 1

  00100001 00000001 01100001 00000001 01100010 =
    Two UTF-8 Text values, each with length = 1

  01000000 00000001 =
    A single Number value

  10000000 00000001 =
    A single Date value (1 second after the epoch)

  11000000 00000001 00000001 =
    A single Raw Binary value

For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
used for the value. If we can develop an approach to effectively
handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
coding to that as well.

For the Number and DateTime serialization:

  - 16-bit numbers serialize with a strict maximum of 3 bytes
  - 32-bit numbers serialize with a strict maximum of 5 bytes.
  - 64-bit numbers serialize with a strict maximum of 10 bytes.
  - Date Times will serialize with five bytes for the reasonably
relevant future, then six bytes for quite some time after that. Dates
prior to the epoch cannot be represented.

In order to properly deal with the backwards compatibility concerns
for HTTP/1, there are several important rules for use of Typed Codecs
in HTTP headers:

1. Headers must be explicitly defined to use the new header types. All
existing HTTP/1 headers, then, will continue to be required to be
represented as ISO-8859-1 Text unless their standard definitions are
updated. The HTTP/2 specification would update the definition of
specific known headers (e.g. content-length, date, if-modified-since,
etc).

2. Extension headers that use the typed codecs will have specific
normative transformations to ISO-8859-1 defined.
    a. UTF-8 Text will be converted to ISO-8859-1 with extended
characters pct-encoded
    b. Numbers will be converted to their ASCII equivalent values.
    c. Date Times will be converted to their HTTP-Date equivalent values.
    d. Binary fields will be Base64-encoded.

3. There will be no normative transformation from ISO-8859-1 values
into the typed codecs. Implementations are free to apply
transformation where those impls determine it is appropriate, but it
will be perfectly legal for an implementation to pass a text value
through even if it is known that a given header type has a typed codec
equivalent (for instance, Content-Length may come through as a number
or a text value, either will be valid). This means that when
translating from HTTP/1 -> HTTP/2, receiving implementations need to
be prepared to handle either value form.

This approach combines what I feel are the best concepts from
HeaderDiff and Delta and provides simple, straightforward header
serialization and state management. Obviously, lots of testing is
required, however.

As always, thoughts/opinions/gripes are appreciated.

- James

Received on Saturday, 13 April 2013 22:02:11 UTC