Re: Header Serialization Discussion from James M Snell on 2013-04-16 (ietf-http-wg@w3.org from April to June 2013)

From: James M Snell <jasnell@gmail.com>
Date: Tue, 16 Apr 2013 14:00:15 -0700
To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CABP7RbdMwbORi+-LkLKUM6zAWDkkdKngZWrUj72=OFQK9sxuGg@mail.gmail.com>
Ok... I have a working implementation of the revised compression and
typed codecs I proposed. It is checked into github here [1]

[1] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/dhe

After pulling, you can test it with the compression_test suite using
the included dhe_compression_test.sh script

./compare_compressors.py -c
fork="{path_to_repo}/dhe_compression_test.sh"
../http_samples/mnot/amazon.com.har

Here are some sample numbers using the amazon set...

* TOTAL: 366 req messages
               size  time | ratio min   max   std
     http1  200,876  0.00 | 1.00  1.00  1.00  0.00
       dhe   51,427  0.02 | 0.26  0.00  0.71  0.16
    delta2   35,850  0.40 | 0.18  0.03  0.67  0.12
headerdiff   38,673  0.23 | 0.19  0.03  0.88  0.16

* TOTAL: 366 res messages
               size  time | ratio min   max   std
     http1  160,435  0.03 | 1.00  1.00  1.00  0.00
       dhe   51,487  0.05 | 0.32  0.01  0.80  0.16
    delta2   42,764  0.42 | 0.27  0.02  0.65  0.10
headerdiff   46,321  0.62 | 0.29  0.04  0.84  0.13

Using the yahoo test set, the numbers are...

* TOTAL: 142 req messages
               size  time | ratio min   max   std
     http1  107,164  0.00 | 1.00  1.00  1.00  0.00
       dhe   45,208  0.01 | 0.42  0.00  1.14  0.25
    delta2   46,529  0.27 | 0.43  0.05  0.79  0.22
headerdiff   52,739  0.05 | 0.49  0.04  0.94  0.26

* TOTAL: 142 res messages
               size  time | ratio min   max   std
     http1   59,718  0.00 | 1.00  1.00  1.00  0.00
       dhe   14,180  0.05 | 0.24  0.00  0.69  0.17
    delta2   16,148  0.17 | 0.27  0.02  0.69  0.17
headerdiff   18,599  0.02 | 0.31  0.04  0.85  0.23

- James

On Sat, Apr 13, 2013 at 3:01 PM, James M Snell <jasnell@gmail.com> wrote:
> Ok... so I've implemented HeaderDiff serialization [1] in java [2]...
> just the serialization so far, will do deserialization early next
> week. The implementation is very rough and definitely needs
> improvement / optimization but it's functional enough to start from.
>
> [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00
> [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff
>
> I know that Roberto is working on refinements to his own concept of
> the delta header serialization and hopefully he'll be sharing those
> thoughts soon, but I wanted to get my current thoughts captured for
> discussion.
>
> The HeaderDiff approach is straightforward and effective and makes
> efficient use of available bits. It's variable length integer syntax
> is a bit complicated with the notion of bit-prefix lengths but once
> you've got it it's easy to understand what's going on. The
> implementation is a bit less complicated than delta, which is good and
> I like that there's no "Header Group" notion. Headers are either on or
> off per request. There are several concerns, however.
>
>   - The true utility of the common prefix length mechanism is
> questionable. Aside from the potential security risks, I questioning
> just how effective it's going to be in practice. (What header fields
> do we expect to actually use it in practice?)
>
>   - The fact that items are never removed from the Name Table is
> concerning and poses a potential security risk. The the decompressor
> is forced to maintain a symbol table for every header name it
> encounters, a malicious compressor could cause overruns by sending a
> high number of junk header names. The compressor ought to be able to
> treat the Name Table as an LRU, and ought to be able to place strict
> limits on it's size, just as it does with the Header Table. Delta does
> not have this issue because it's name indices are already tied to the
> LRU.
>
> With HeaderDiff in mind, I'm thinking about how we can bring
> HeaderDiff and Delta together and roll in the additional concepts of
> Typed Codecs. Here's what I have in mind.
>
> Syntax:
>
> The serialization syntax borrows concepts from both HeaderDiff and
> Delta and is divided into two parts: Basic Header Serialization and
> Value Serialization.
>
> As with Delta currently, there are four operation codes (opcodes),
> represented by two-bit identifiers. Each op code has a single-bit
> "ephemeral flag". Operations are serialized into batches of no more
> than 32 instances.
>
>   00 0 00000 { HEADER }
>
> The first two most significant bits...
>
>   00 = Index (what Delta calls "Toggl")
>   01 = Index Range (what Delta calls "Trang")
>   10 = Clone
>   11 = Literal (what Delta calls "KVSto")
>
> The third most significant bit is the ephemeral toggle
>
>   0 = Persist (the header value is to be persisted in the LRU table)
>   1 = Ephemeral (the header value is not to be persisted in the LRU table)
>
> The remaining five bits specify the number of distinct instances in
> this batch. This counter is 0-based, so 0x0 = 1 item in the batch,
> 0x1F = 32 items in the batch.
>
> The ephemeral flag is ignored for opcodes 00 and 01.
>
> Indexed headers are identified by a single byte index value.
>
>   0 0000000
>
> The first most significant bit indicates whether or not the index is
> part of the static header dictionary or lru cache.
>
>   0 = LRU
>   1 = Static
>
> This gives us a strict maximum of 128 storage slots for the LRU.
> That's limited *on purpose* to place strictly finite constraints on
> the amount of state any single compressor can require. This also
> ensures that we are always able to encode indices using only a single
> byte.
>
> As for state management, we would essentially use the Header Diff
> model minus the separate Name Table. That is, there is no persistent
> Header Group in use; just the Header Table. State management depends
> on the specific opcodes used. State management occurs as the headers
> are encountered in the serialized header block.
>
>   00 = Use an indexed header instance from the header table for this request
>   01 = Use multiple sequentially indexed header instances from the
> header table for this request
>   10 = Use a new value for an already existing header name, index
> references an existing header from the table, persist the new value if
> the ephemeral bit is not set.
>   11 = Use a new key and value for this request. Persist the new pair
> if the ephemeral bit is not set.
>
> So, for example, the serialization:
>
>   00000000 10000000 =
>     Use index #0 from the static dictionary
>
>   00000001 10000000 00000001 =
>     Use index #0 from the static dictionary and index #1 from the lru cache
>
>   01000001 10000000 10000001 00000001 00000010 =
>     Use indices #0-#1 from the static dictionary and indices #1-#2
> from the lru cache
>
>   10000000 10000000 {VALUE}
>     Provide a new value for the header name identified by index #0 in
> the static dictionary and persist the value in the lru cache
>
>   10100000 10000000 {VALUE}
>     Provide a new value for the header name identified by index #0 in
> the static dictionary but do not persist the value in the lru cache
>
> Non-ephemeral items are pushed into the next available slot in the
> LRU. Once all available slots have been used, the indices rotate. The
> LRU can also be limited by storage space. As storage space is
> required, the least recently items are dropped. Keeping in mind that
> there are only 128 available storage slots, it will be important for
> compressors to be selective about which fields are persisted.
>
> (note.. the LRU is really a "least recently written" cache.. not
> "least recently used"...)
>
> These values are tracked only for the current message. There is no
> Header Group as currently defined by Delta so there is no need to
> track whether or not any given header is currently on or off. It's
> either referenced for this request or it's not.
>
> If there are more than 32 instances of any single opcode per
> serialized header block (which is unlikely), a single opcode may
> appear multiple times within the serialized block.
>
> As far as Value serialization is concerned... As I have discussed
> previously, I have been looking at support for four distinct value
> types, each represented by their own two-bit identifier. As I will
> explain further in a second, these types are represented using the two
> most significant bits in a byte:
>
>   00 = Text
>   01 = Number
>   10 = Date Time
>   11 = Binary
>
> Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
> following the type code. All text strings are prefixed by it's length
> given as an unsigned variant length integer
> Numbers are encoded as unsigned variable length integers.
> Date Times are encoded as Numbers representing the number of seconds
> that have passed since the epoch (GMT).
> Binary data is encoded as raw binary octets prefixed by an unsigned
> variable length integer specifying the number of bytes.
>
> Any single header value instance can consist of up to 32 distinct
> value tokens. The number of tokens is specified using the 5 least
> significant bytes of type type flag.
>
> For instance:
>
>   00000000 00000001 01100001 =
>     A single ISO-8859-1 Text value with length = 1
>
>   00100000 00000001 01100001  =
>     A single UTF-8 Text value with length = 1
>
>   00100001 00000001 01100001 00000001 01100010 =
>     Two UTF-8 Text values, each with length = 1
>
>   01000000 00000001 =
>     A single Number value
>
>   10000000 00000001 =
>     A single Date value (1 second after the epoch)
>
>   11000000 00000001 00000001 =
>     A single Raw Binary value
>
> For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
> used for the value. If we can develop an approach to effectively
> handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
> coding to that as well.
>
> For the Number and DateTime serialization:
>
>   - 16-bit numbers serialize with a strict maximum of 3 bytes
>   - 32-bit numbers serialize with a strict maximum of 5 bytes.
>   - 64-bit numbers serialize with a strict maximum of 10 bytes.
>   - Date Times will serialize with five bytes for the reasonably
> relevant future, then six bytes for quite some time after that. Dates
> prior to the epoch cannot be represented.
>
> In order to properly deal with the backwards compatibility concerns
> for HTTP/1, there are several important rules for use of Typed Codecs
> in HTTP headers:
>
> 1. Headers must be explicitly defined to use the new header types. All
> existing HTTP/1 headers, then, will continue to be required to be
> represented as ISO-8859-1 Text unless their standard definitions are
> updated. The HTTP/2 specification would update the definition of
> specific known headers (e.g. content-length, date, if-modified-since,
> etc).
>
> 2. Extension headers that use the typed codecs will have specific
> normative transformations to ISO-8859-1 defined.
>     a. UTF-8 Text will be converted to ISO-8859-1 with extended
> characters pct-encoded
>     b. Numbers will be converted to their ASCII equivalent values.
>     c. Date Times will be converted to their HTTP-Date equivalent values.
>     d. Binary fields will be Base64-encoded.
>
> 3. There will be no normative transformation from ISO-8859-1 values
> into the typed codecs. Implementations are free to apply
> transformation where those impls determine it is appropriate, but it
> will be perfectly legal for an implementation to pass a text value
> through even if it is known that a given header type has a typed codec
> equivalent (for instance, Content-Length may come through as a number
> or a text value, either will be valid). This means that when
> translating from HTTP/1 -> HTTP/2, receiving implementations need to
> be prepared to handle either value form.
>
> This approach combines what I feel are the best concepts from
> HeaderDiff and Delta and provides simple, straightforward header
> serialization and state management. Obviously, lots of testing is
> required, however.
>
> As always, thoughts/opinions/gripes are appreciated.
>
> - James
Received on Tuesday, 16 April 2013 21:01:02 UTC