HTTP/2 Header Encoding Status Update from James M Snell on 2013-02-27 (ietf-http-wg@w3.org from January to March 2013)

From: James M Snell <jasnell@gmail.com>
Date: Wed, 27 Feb 2013 13:16:16 -0800
To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CABP7RbfK9jT=-wXqv8wo6fJr8Wg0g9SYTZ3FeXHC=4yhihdsug@mail.gmail.com>
As I'm not going to be able to meet y'all in Orlando, I wanted to give
a quick status update on where I'm at with the Header Encoding.

After much experimentation, implementation and discussion with
Roberto, I've got an implementation that is a good balance between
Delta and BOHE. It uses the fundamental delta encoding mechanism that
Roberto created (with a few notable changes) but allows for four
distinct types of header values: text, numeric, date and raw-binary.
Every header value is encoded with a single additional byte flag that
indicates the value type. Further, the flags indicate whether text is
ISO-8559-1 or UTF-8 and whether it is huffman-coded or not.  This
scheme gives us a good balance and allows us to achieve maximum
compression ratios and increased long term functionality without
sacrificing too much in complexity or backwards compatibility.

Text values:

  The rules for text values are simple:

  1. Text can be either UTF-8 or ISO-8859-1 Encoded. A single bit in
the flags is used to indicate which it is.
  2. Text can be huffman coded or not. A single bit in the flags is
used to indicate which.
  3. A single text header field may contain up to 0xFF separate text
strings of arbitrary length.
  4. Each individual text string is prefixed by a unsigned, variable
length integer specifying the length of the string.

  For example, assuming UTF-8 and non-huffman coded values...

  The string value "Bar" is encoded as:
    10 00 03 42 61 72

    10 = Flags byte
    00 = Number of values encoded (0-based.. 00 == one value)
    03 = uvarint length of the value
    42 61 72 = UTF-8 bytes

  The multiple string values ["Bar", "Baz"] would encode as:
     10 01 03 42 61 72 03 42 61 7A

Numeric values:

  1. Numeric values are encoded as variable-length, unsigned integers.
  2. Numeric headers can only have a single encoded value (text values
are the only ones that allow multiples)

  For example, value "100" is encoded as 24 64

  25 = Flags byte indicating numeric value
  64 = uvarint encoded value

Date values:

  1. Dates are encoded as the number of seconds since a new epoch
(Midnight GMT, Jan 1 1990)
  2. Encoded as uvarints, just like Numeric values, but with an
additional flag bit set

  For example, the date "2013-02-27T12:51:19.409-08:00" is encoded as:
    26 AA A9 BF DC 02

  26 = Flags byte indicating date value
  AA A9 BF DC 02 = uvarint indicating the date

Binary values:

  1. Binary values are encoded as a raw sequence of octets prefixed by
the length encoded as a uvarint

  For example, the byte sequence 01 02 03 is encoded as:
    20 03 01 02 03

Those four encodings would be all we define within the basic header
encoding. At the HTTP semantics layer, specific headers would be
required to explicitly declare support for numeric, date or binary
encoding. This means that all of the existing HTTP header definitions,
which are currently all text, would be required to use the Text value
encoding unless they are specifically redefined to use the new type
options. Specific headers in the core set (:status, content-length,
date, last-modified, etc) would be have updated definitions to allow
those to use the new more compact encodings right from the start. This
allows us to get some immediate benefit out of the box and gives us a
way of smoothly transitioning headers over to more compact and
optimized encoding options later on without sacrificing backwards
compatibility.

Differences from Delta:

  The scheme I have implemented has a number of technical details that
are different from Roberto's delta implementation. These details are
fairly technical and low level so feel free to ignore them unless you
have a particular interest in implementation details at this point...

  1. For one, I wrote the thing in Java.. which is good, we now have
at least three different language implementations of the basic delta
scheme. (C++, python and Java)
  2. It uses uvarints for all length prefixing rather than fixed width
integer encodings.
  3. It uses all of the above type encodings
  4. It uses a slightly different mechanism for managing and
renumbering the delta storage state index. This is an internal
difference in implementation but all implementations will be required
to implement the same basic scheme for reindexing internal state so I
am evaluating different options to see which is the most efficient.
  5. It introduces an additional "Ephemeral Clone" opcode which is
essentially a combination of Clone and Eref (it's like clone but
doesn't change state).
  6. It implements additional heuristics for determining when to merge
multiple adjacent toggl indices into a single trang operation.
Basically, it just checks to see if converting to a Trang would save
or cost additional bytes before doing the conversion.
  7. It uses eref or eclone operations for :path, referer,
authorization, www-authenticate, proxy-authenticate, date and
last-modified headers. These tend to change very frequently and take
up a significant amount of space in the storage context. Given how
variable these are, it doesn't make sense to store them. The
likelihood of reuse within a given context is minimal compared to the
cost of storage.
  8. It uses a modified version of the default header dictionary used
by Roberto's delta implementation. I've added additional values and
changed it to use the new value encodings.
  9. I have not yet implemented the additional huffman coding for text values.

I'm still working on making the delta state management as efficient as
possible in my implementation. The running time for the
serialize-deserialize roundtrip is still well above what I'm happy
with. Part of that is Java's fault, part of it is the delta mechanism
itself.

Anyway, that's the run down. Still a ways off from having something
that I'd be comfortable calling "complete" but looking pretty good so
far.

- James
Received on Wednesday, 27 February 2013 21:17:03 UTC