- From: James M Snell <jasnell@gmail.com>
- Date: Tue, 16 Apr 2013 14:00:15 -0700
- To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Ok... I have a working implementation of the revised compression and
typed codecs I proposed. It is checked into github here [1]
[1] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/dhe
After pulling, you can test it with the compression_test suite using
the included dhe_compression_test.sh script
./compare_compressors.py -c
fork="{path_to_repo}/dhe_compression_test.sh"
../http_samples/mnot/amazon.com.har
Here are some sample numbers using the amazon set...
* TOTAL: 366 req messages
size time | ratio min max std
http1 200,876 0.00 | 1.00 1.00 1.00 0.00
dhe 51,427 0.02 | 0.26 0.00 0.71 0.16
delta2 35,850 0.40 | 0.18 0.03 0.67 0.12
headerdiff 38,673 0.23 | 0.19 0.03 0.88 0.16
* TOTAL: 366 res messages
size time | ratio min max std
http1 160,435 0.03 | 1.00 1.00 1.00 0.00
dhe 51,487 0.05 | 0.32 0.01 0.80 0.16
delta2 42,764 0.42 | 0.27 0.02 0.65 0.10
headerdiff 46,321 0.62 | 0.29 0.04 0.84 0.13
Using the yahoo test set, the numbers are...
* TOTAL: 142 req messages
size time | ratio min max std
http1 107,164 0.00 | 1.00 1.00 1.00 0.00
dhe 45,208 0.01 | 0.42 0.00 1.14 0.25
delta2 46,529 0.27 | 0.43 0.05 0.79 0.22
headerdiff 52,739 0.05 | 0.49 0.04 0.94 0.26
* TOTAL: 142 res messages
size time | ratio min max std
http1 59,718 0.00 | 1.00 1.00 1.00 0.00
dhe 14,180 0.05 | 0.24 0.00 0.69 0.17
delta2 16,148 0.17 | 0.27 0.02 0.69 0.17
headerdiff 18,599 0.02 | 0.31 0.04 0.85 0.23
- James
On Sat, Apr 13, 2013 at 3:01 PM, James M Snell <jasnell@gmail.com> wrote:
> Ok... so I've implemented HeaderDiff serialization [1] in java [2]...
> just the serialization so far, will do deserialization early next
> week. The implementation is very rough and definitely needs
> improvement / optimization but it's functional enough to start from.
>
> [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00
> [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff
>
> I know that Roberto is working on refinements to his own concept of
> the delta header serialization and hopefully he'll be sharing those
> thoughts soon, but I wanted to get my current thoughts captured for
> discussion.
>
> The HeaderDiff approach is straightforward and effective and makes
> efficient use of available bits. It's variable length integer syntax
> is a bit complicated with the notion of bit-prefix lengths but once
> you've got it it's easy to understand what's going on. The
> implementation is a bit less complicated than delta, which is good and
> I like that there's no "Header Group" notion. Headers are either on or
> off per request. There are several concerns, however.
>
> - The true utility of the common prefix length mechanism is
> questionable. Aside from the potential security risks, I questioning
> just how effective it's going to be in practice. (What header fields
> do we expect to actually use it in practice?)
>
> - The fact that items are never removed from the Name Table is
> concerning and poses a potential security risk. The the decompressor
> is forced to maintain a symbol table for every header name it
> encounters, a malicious compressor could cause overruns by sending a
> high number of junk header names. The compressor ought to be able to
> treat the Name Table as an LRU, and ought to be able to place strict
> limits on it's size, just as it does with the Header Table. Delta does
> not have this issue because it's name indices are already tied to the
> LRU.
>
> With HeaderDiff in mind, I'm thinking about how we can bring
> HeaderDiff and Delta together and roll in the additional concepts of
> Typed Codecs. Here's what I have in mind.
>
> Syntax:
>
> The serialization syntax borrows concepts from both HeaderDiff and
> Delta and is divided into two parts: Basic Header Serialization and
> Value Serialization.
>
> As with Delta currently, there are four operation codes (opcodes),
> represented by two-bit identifiers. Each op code has a single-bit
> "ephemeral flag". Operations are serialized into batches of no more
> than 32 instances.
>
> 00 0 00000 { HEADER }
>
> The first two most significant bits...
>
> 00 = Index (what Delta calls "Toggl")
> 01 = Index Range (what Delta calls "Trang")
> 10 = Clone
> 11 = Literal (what Delta calls "KVSto")
>
> The third most significant bit is the ephemeral toggle
>
> 0 = Persist (the header value is to be persisted in the LRU table)
> 1 = Ephemeral (the header value is not to be persisted in the LRU table)
>
> The remaining five bits specify the number of distinct instances in
> this batch. This counter is 0-based, so 0x0 = 1 item in the batch,
> 0x1F = 32 items in the batch.
>
> The ephemeral flag is ignored for opcodes 00 and 01.
>
> Indexed headers are identified by a single byte index value.
>
> 0 0000000
>
> The first most significant bit indicates whether or not the index is
> part of the static header dictionary or lru cache.
>
> 0 = LRU
> 1 = Static
>
> This gives us a strict maximum of 128 storage slots for the LRU.
> That's limited *on purpose* to place strictly finite constraints on
> the amount of state any single compressor can require. This also
> ensures that we are always able to encode indices using only a single
> byte.
>
> As for state management, we would essentially use the Header Diff
> model minus the separate Name Table. That is, there is no persistent
> Header Group in use; just the Header Table. State management depends
> on the specific opcodes used. State management occurs as the headers
> are encountered in the serialized header block.
>
> 00 = Use an indexed header instance from the header table for this request
> 01 = Use multiple sequentially indexed header instances from the
> header table for this request
> 10 = Use a new value for an already existing header name, index
> references an existing header from the table, persist the new value if
> the ephemeral bit is not set.
> 11 = Use a new key and value for this request. Persist the new pair
> if the ephemeral bit is not set.
>
> So, for example, the serialization:
>
> 00000000 10000000 =
> Use index #0 from the static dictionary
>
> 00000001 10000000 00000001 =
> Use index #0 from the static dictionary and index #1 from the lru cache
>
> 01000001 10000000 10000001 00000001 00000010 =
> Use indices #0-#1 from the static dictionary and indices #1-#2
> from the lru cache
>
> 10000000 10000000 {VALUE}
> Provide a new value for the header name identified by index #0 in
> the static dictionary and persist the value in the lru cache
>
> 10100000 10000000 {VALUE}
> Provide a new value for the header name identified by index #0 in
> the static dictionary but do not persist the value in the lru cache
>
> Non-ephemeral items are pushed into the next available slot in the
> LRU. Once all available slots have been used, the indices rotate. The
> LRU can also be limited by storage space. As storage space is
> required, the least recently items are dropped. Keeping in mind that
> there are only 128 available storage slots, it will be important for
> compressors to be selective about which fields are persisted.
>
> (note.. the LRU is really a "least recently written" cache.. not
> "least recently used"...)
>
> These values are tracked only for the current message. There is no
> Header Group as currently defined by Delta so there is no need to
> track whether or not any given header is currently on or off. It's
> either referenced for this request or it's not.
>
> If there are more than 32 instances of any single opcode per
> serialized header block (which is unlikely), a single opcode may
> appear multiple times within the serialized block.
>
> As far as Value serialization is concerned... As I have discussed
> previously, I have been looking at support for four distinct value
> types, each represented by their own two-bit identifier. As I will
> explain further in a second, these types are represented using the two
> most significant bits in a byte:
>
> 00 = Text
> 01 = Number
> 10 = Date Time
> 11 = Binary
>
> Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
> following the type code. All text strings are prefixed by it's length
> given as an unsigned variant length integer
> Numbers are encoded as unsigned variable length integers.
> Date Times are encoded as Numbers representing the number of seconds
> that have passed since the epoch (GMT).
> Binary data is encoded as raw binary octets prefixed by an unsigned
> variable length integer specifying the number of bytes.
>
> Any single header value instance can consist of up to 32 distinct
> value tokens. The number of tokens is specified using the 5 least
> significant bytes of type type flag.
>
> For instance:
>
> 00000000 00000001 01100001 =
> A single ISO-8859-1 Text value with length = 1
>
> 00100000 00000001 01100001 =
> A single UTF-8 Text value with length = 1
>
> 00100001 00000001 01100001 00000001 01100010 =
> Two UTF-8 Text values, each with length = 1
>
> 01000000 00000001 =
> A single Number value
>
> 10000000 00000001 =
> A single Date value (1 second after the epoch)
>
> 11000000 00000001 00000001 =
> A single Raw Binary value
>
> For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
> used for the value. If we can develop an approach to effectively
> handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
> coding to that as well.
>
> For the Number and DateTime serialization:
>
> - 16-bit numbers serialize with a strict maximum of 3 bytes
> - 32-bit numbers serialize with a strict maximum of 5 bytes.
> - 64-bit numbers serialize with a strict maximum of 10 bytes.
> - Date Times will serialize with five bytes for the reasonably
> relevant future, then six bytes for quite some time after that. Dates
> prior to the epoch cannot be represented.
>
> In order to properly deal with the backwards compatibility concerns
> for HTTP/1, there are several important rules for use of Typed Codecs
> in HTTP headers:
>
> 1. Headers must be explicitly defined to use the new header types. All
> existing HTTP/1 headers, then, will continue to be required to be
> represented as ISO-8859-1 Text unless their standard definitions are
> updated. The HTTP/2 specification would update the definition of
> specific known headers (e.g. content-length, date, if-modified-since,
> etc).
>
> 2. Extension headers that use the typed codecs will have specific
> normative transformations to ISO-8859-1 defined.
> a. UTF-8 Text will be converted to ISO-8859-1 with extended
> characters pct-encoded
> b. Numbers will be converted to their ASCII equivalent values.
> c. Date Times will be converted to their HTTP-Date equivalent values.
> d. Binary fields will be Base64-encoded.
>
> 3. There will be no normative transformation from ISO-8859-1 values
> into the typed codecs. Implementations are free to apply
> transformation where those impls determine it is appropriate, but it
> will be perfectly legal for an implementation to pass a text value
> through even if it is known that a given header type has a typed codec
> equivalent (for instance, Content-Length may come through as a number
> or a text value, either will be valid). This means that when
> translating from HTTP/1 -> HTTP/2, receiving implementations need to
> be prepared to handle either value form.
>
> This approach combines what I feel are the best concepts from
> HeaderDiff and Delta and provides simple, straightforward header
> serialization and state management. Obviously, lots of testing is
> required, however.
>
> As always, thoughts/opinions/gripes are appreciated.
>
> - James
Received on Tuesday, 16 April 2013 21:01:02 UTC