Re: HTTP/2 Header Encoding Status Update from Mark Nottingham on 2013-02-27 (ietf-http-wg@w3.org from January to March 2013)

From: Mark Nottingham <mnot@mnot.net>
Date: Thu, 28 Feb 2013 09:45:08 +1100
To: James M Snell <jasnell@gmail.com>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-Id: <0DE7FB38-9484-4FED-84C7-A034AFBFA8E6@mnot.net>
Hi James,

On 28/02/2013, at 8:16 AM, James M Snell <jasnell@gmail.com> wrote:

> As I'm not going to be able to meet y'all in Orlando, I wanted to give
> a quick status update on where I'm at with the Header Encoding.
> 
> After much experimentation, implementation and discussion with
> Roberto, I've got an implementation that is a good balance between
> Delta and BOHE. It uses the fundamental delta encoding mechanism that
> Roberto created (with a few notable changes) but allows for four
> distinct types of header values: text, numeric, date and raw-binary.
> Every header value is encoded with a single additional byte flag that
> indicates the value type. Further, the flags indicate whether text is
> ISO-8559-1 or UTF-8 and whether it is huffman-coded or not.  This
> scheme gives us a good balance and allows us to achieve maximum
> compression ratios and increased long term functionality without
> sacrificing too much in complexity or backwards compatibility.

Thanks. That sounds good.


> Text values:
> 
>  The rules for text values are simple:
> 
>  1. Text can be either UTF-8 or ISO-8859-1 Encoded. A single bit in
> the flags is used to indicate which it is.
>  2. Text can be huffman coded or not. A single bit in the flags is
> used to indicate which.
>  3. A single text header field may contain up to 0xFF separate text
> strings of arbitrary length.
>  4. Each individual text string is prefixed by a unsigned, variable
> length integer specifying the length of the string.
> 
>  For example, assuming UTF-8 and non-huffman coded values...
> 
>  The string value "Bar" is encoded as:
>    10 00 03 42 61 72
> 
>    10 = Flags byte
>    00 = Number of values encoded (0-based.. 00 == one value)
>    03 = uvarint length of the value
>    42 61 72 = UTF-8 bytes
> 
>  The multiple string values ["Bar", "Baz"] would encode as:
>     10 01 03 42 61 72 03 42 61 7A

So, the biggest concern here, I think, is that the conversion of a UTF-8 value to ASCII/Latin-1 -- to be able to forward the header on a HTTP/1.x hop -- requires knowledge of the header.

Would you want to define a standard way to encode UTF-8 in Latin-1 (e.g., percent-encoding) for headers that use this? It would constrain the headers (and likely rule out any existing headers from using UTF-8), but I don't see how this is going to be viable otherwise.


> Numeric values:
> 
>  1. Numeric values are encoded as variable-length, unsigned integers.
>  2. Numeric headers can only have a single encoded value (text values
> are the only ones that allow multiples)
> 
>  For example, value "100" is encoded as 24 64
> 
>  25 = Flags byte indicating numeric value
>  64 = uvarint encoded value
> 
> Date values:
> 
>  1. Dates are encoded as the number of seconds since a new epoch
> (Midnight GMT, Jan 1 1990)
>  2. Encoded as uvarints, just like Numeric values, but with an
> additional flag bit set
> 
>  For example, the date "2013-02-27T12:51:19.409-08:00" is encoded as:
>    26 AA A9 BF DC 02
> 
>  26 = Flags byte indicating date value
>  AA A9 BF DC 02 = uvarint indicating the date
> 
> Binary values:
> 
>  1. Binary values are encoded as a raw sequence of octets prefixed by
> the length encoded as a uvarint
> 
>  For example, the byte sequence 01 02 03 is encoded as:
>    20 03 01 02 03

Similar problem as with UTF-8, unless you define a single transformation between binary and ASCII (base64?).


> Those four encodings would be all we define within the basic header
> encoding. At the HTTP semantics layer, specific headers would be
> required to explicitly declare support for numeric, date or binary
> encoding. This means that all of the existing HTTP header definitions,
> which are currently all text, would be required to use the Text value
> encoding unless they are specifically redefined to use the new type
> options. Specific headers in the core set (:status, content-length,
> date, last-modified, etc) would be have updated definitions to allow
> those to use the new more compact encodings right from the start. This
> allows us to get some immediate benefit out of the box and gives us a
> way of smoothly transitioning headers over to more compact and
> optimized encoding options later on without sacrificing backwards
> compatibility.
> 
> Differences from Delta:
> 
>  The scheme I have implemented has a number of technical details that
> are different from Roberto's delta implementation. These details are
> fairly technical and low level so feel free to ignore them unless you
> have a particular interest in implementation details at this point...
> 
>  1. For one, I wrote the thing in Java.. which is good, we now have
> at least three different language implementations of the basic delta
> scheme. (C++, python and Java)
>  2. It uses uvarints for all length prefixing rather than fixed width
> integer encodings.
>  3. It uses all of the above type encodings
>  4. It uses a slightly different mechanism for managing and
> renumbering the delta storage state index. This is an internal
> difference in implementation but all implementations will be required
> to implement the same basic scheme for reindexing internal state so I
> am evaluating different options to see which is the most efficient.
>  5. It introduces an additional "Ephemeral Clone" opcode which is
> essentially a combination of Clone and Eref (it's like clone but
> doesn't change state).
>  6. It implements additional heuristics for determining when to merge
> multiple adjacent toggl indices into a single trang operation.
> Basically, it just checks to see if converting to a Trang would save
> or cost additional bytes before doing the conversion.
>  7. It uses eref or eclone operations for :path, referer,
> authorization, www-authenticate, proxy-authenticate, date and
> last-modified headers. These tend to change very frequently and take
> up a significant amount of space in the storage context. Given how
> variable these are, it doesn't make sense to store them. The
> likelihood of reuse within a given context is minimal compared to the
> cost of storage.
>  8. It uses a modified version of the default header dictionary used
> by Roberto's delta implementation. I've added additional values and
> changed it to use the new value encodings.
>  9. I have not yet implemented the additional huffman coding for text values.
> 
> I'm still working on making the delta state management as efficient as
> possible in my implementation. The running time for the
> serialize-deserialize roundtrip is still well above what I'm happy
> with. Part of that is Java's fault, part of it is the delta mechanism
> itself.
> 
> Anyway, that's the run down. Still a ways off from having something
> that I'd be comfortable calling "complete" but looking pretty good so
> far.


All good stuff. Thanks again,

--
Mark Nottingham   http://www.mnot.net/
Received on Wednesday, 27 February 2013 22:45:35 UTC