- From: James M Snell <jasnell@gmail.com>
- Date: Tue, 16 Apr 2013 14:00:15 -0700
- To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Ok... I have a working implementation of the revised compression and typed codecs I proposed. It is checked into github here [1] [1] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/dhe After pulling, you can test it with the compression_test suite using the included dhe_compression_test.sh script ./compare_compressors.py -c fork="{path_to_repo}/dhe_compression_test.sh" ../http_samples/mnot/amazon.com.har Here are some sample numbers using the amazon set... * TOTAL: 366 req messages size time | ratio min max std http1 200,876 0.00 | 1.00 1.00 1.00 0.00 dhe 51,427 0.02 | 0.26 0.00 0.71 0.16 delta2 35,850 0.40 | 0.18 0.03 0.67 0.12 headerdiff 38,673 0.23 | 0.19 0.03 0.88 0.16 * TOTAL: 366 res messages size time | ratio min max std http1 160,435 0.03 | 1.00 1.00 1.00 0.00 dhe 51,487 0.05 | 0.32 0.01 0.80 0.16 delta2 42,764 0.42 | 0.27 0.02 0.65 0.10 headerdiff 46,321 0.62 | 0.29 0.04 0.84 0.13 Using the yahoo test set, the numbers are... * TOTAL: 142 req messages size time | ratio min max std http1 107,164 0.00 | 1.00 1.00 1.00 0.00 dhe 45,208 0.01 | 0.42 0.00 1.14 0.25 delta2 46,529 0.27 | 0.43 0.05 0.79 0.22 headerdiff 52,739 0.05 | 0.49 0.04 0.94 0.26 * TOTAL: 142 res messages size time | ratio min max std http1 59,718 0.00 | 1.00 1.00 1.00 0.00 dhe 14,180 0.05 | 0.24 0.00 0.69 0.17 delta2 16,148 0.17 | 0.27 0.02 0.69 0.17 headerdiff 18,599 0.02 | 0.31 0.04 0.85 0.23 - James On Sat, Apr 13, 2013 at 3:01 PM, James M Snell <jasnell@gmail.com> wrote: > Ok... so I've implemented HeaderDiff serialization [1] in java [2]... > just the serialization so far, will do deserialization early next > week. The implementation is very rough and definitely needs > improvement / optimization but it's functional enough to start from. > > [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00 > [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff > > I know that Roberto is working on refinements to his own concept of > the delta header serialization and hopefully he'll be sharing those > thoughts soon, but I wanted to get my current thoughts captured for > discussion. > > The HeaderDiff approach is straightforward and effective and makes > efficient use of available bits. It's variable length integer syntax > is a bit complicated with the notion of bit-prefix lengths but once > you've got it it's easy to understand what's going on. The > implementation is a bit less complicated than delta, which is good and > I like that there's no "Header Group" notion. Headers are either on or > off per request. There are several concerns, however. > > - The true utility of the common prefix length mechanism is > questionable. Aside from the potential security risks, I questioning > just how effective it's going to be in practice. (What header fields > do we expect to actually use it in practice?) > > - The fact that items are never removed from the Name Table is > concerning and poses a potential security risk. The the decompressor > is forced to maintain a symbol table for every header name it > encounters, a malicious compressor could cause overruns by sending a > high number of junk header names. The compressor ought to be able to > treat the Name Table as an LRU, and ought to be able to place strict > limits on it's size, just as it does with the Header Table. Delta does > not have this issue because it's name indices are already tied to the > LRU. > > With HeaderDiff in mind, I'm thinking about how we can bring > HeaderDiff and Delta together and roll in the additional concepts of > Typed Codecs. Here's what I have in mind. > > Syntax: > > The serialization syntax borrows concepts from both HeaderDiff and > Delta and is divided into two parts: Basic Header Serialization and > Value Serialization. > > As with Delta currently, there are four operation codes (opcodes), > represented by two-bit identifiers. Each op code has a single-bit > "ephemeral flag". Operations are serialized into batches of no more > than 32 instances. > > 00 0 00000 { HEADER } > > The first two most significant bits... > > 00 = Index (what Delta calls "Toggl") > 01 = Index Range (what Delta calls "Trang") > 10 = Clone > 11 = Literal (what Delta calls "KVSto") > > The third most significant bit is the ephemeral toggle > > 0 = Persist (the header value is to be persisted in the LRU table) > 1 = Ephemeral (the header value is not to be persisted in the LRU table) > > The remaining five bits specify the number of distinct instances in > this batch. This counter is 0-based, so 0x0 = 1 item in the batch, > 0x1F = 32 items in the batch. > > The ephemeral flag is ignored for opcodes 00 and 01. > > Indexed headers are identified by a single byte index value. > > 0 0000000 > > The first most significant bit indicates whether or not the index is > part of the static header dictionary or lru cache. > > 0 = LRU > 1 = Static > > This gives us a strict maximum of 128 storage slots for the LRU. > That's limited *on purpose* to place strictly finite constraints on > the amount of state any single compressor can require. This also > ensures that we are always able to encode indices using only a single > byte. > > As for state management, we would essentially use the Header Diff > model minus the separate Name Table. That is, there is no persistent > Header Group in use; just the Header Table. State management depends > on the specific opcodes used. State management occurs as the headers > are encountered in the serialized header block. > > 00 = Use an indexed header instance from the header table for this request > 01 = Use multiple sequentially indexed header instances from the > header table for this request > 10 = Use a new value for an already existing header name, index > references an existing header from the table, persist the new value if > the ephemeral bit is not set. > 11 = Use a new key and value for this request. Persist the new pair > if the ephemeral bit is not set. > > So, for example, the serialization: > > 00000000 10000000 = > Use index #0 from the static dictionary > > 00000001 10000000 00000001 = > Use index #0 from the static dictionary and index #1 from the lru cache > > 01000001 10000000 10000001 00000001 00000010 = > Use indices #0-#1 from the static dictionary and indices #1-#2 > from the lru cache > > 10000000 10000000 {VALUE} > Provide a new value for the header name identified by index #0 in > the static dictionary and persist the value in the lru cache > > 10100000 10000000 {VALUE} > Provide a new value for the header name identified by index #0 in > the static dictionary but do not persist the value in the lru cache > > Non-ephemeral items are pushed into the next available slot in the > LRU. Once all available slots have been used, the indices rotate. The > LRU can also be limited by storage space. As storage space is > required, the least recently items are dropped. Keeping in mind that > there are only 128 available storage slots, it will be important for > compressors to be selective about which fields are persisted. > > (note.. the LRU is really a "least recently written" cache.. not > "least recently used"...) > > These values are tracked only for the current message. There is no > Header Group as currently defined by Delta so there is no need to > track whether or not any given header is currently on or off. It's > either referenced for this request or it's not. > > If there are more than 32 instances of any single opcode per > serialized header block (which is unlikely), a single opcode may > appear multiple times within the serialized block. > > As far as Value serialization is concerned... As I have discussed > previously, I have been looking at support for four distinct value > types, each represented by their own two-bit identifier. As I will > explain further in a second, these types are represented using the two > most significant bits in a byte: > > 00 = Text > 01 = Number > 10 = Date Time > 11 = Binary > > Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag > following the type code. All text strings are prefixed by it's length > given as an unsigned variant length integer > Numbers are encoded as unsigned variable length integers. > Date Times are encoded as Numbers representing the number of seconds > that have passed since the epoch (GMT). > Binary data is encoded as raw binary octets prefixed by an unsigned > variable length integer specifying the number of bytes. > > Any single header value instance can consist of up to 32 distinct > value tokens. The number of tokens is specified using the 5 least > significant bytes of type type flag. > > For instance: > > 00000000 00000001 01100001 = > A single ISO-8859-1 Text value with length = 1 > > 00100000 00000001 01100001 = > A single UTF-8 Text value with length = 1 > > 00100001 00000001 01100001 00000001 01100010 = > Two UTF-8 Text values, each with length = 1 > > 01000000 00000001 = > A single Number value > > 10000000 00000001 = > A single Date value (1 second after the epoch) > > 11000000 00000001 00000001 = > A single Raw Binary value > > For ISO-8859-1 Text, the Static Huffman Code used by Delta would be > used for the value. If we can develop an approach to effectively > handling Huffman coding for arbitrary UTF-8, then we can apply Huffman > coding to that as well. > > For the Number and DateTime serialization: > > - 16-bit numbers serialize with a strict maximum of 3 bytes > - 32-bit numbers serialize with a strict maximum of 5 bytes. > - 64-bit numbers serialize with a strict maximum of 10 bytes. > - Date Times will serialize with five bytes for the reasonably > relevant future, then six bytes for quite some time after that. Dates > prior to the epoch cannot be represented. > > In order to properly deal with the backwards compatibility concerns > for HTTP/1, there are several important rules for use of Typed Codecs > in HTTP headers: > > 1. Headers must be explicitly defined to use the new header types. All > existing HTTP/1 headers, then, will continue to be required to be > represented as ISO-8859-1 Text unless their standard definitions are > updated. The HTTP/2 specification would update the definition of > specific known headers (e.g. content-length, date, if-modified-since, > etc). > > 2. Extension headers that use the typed codecs will have specific > normative transformations to ISO-8859-1 defined. > a. UTF-8 Text will be converted to ISO-8859-1 with extended > characters pct-encoded > b. Numbers will be converted to their ASCII equivalent values. > c. Date Times will be converted to their HTTP-Date equivalent values. > d. Binary fields will be Base64-encoded. > > 3. There will be no normative transformation from ISO-8859-1 values > into the typed codecs. Implementations are free to apply > transformation where those impls determine it is appropriate, but it > will be perfectly legal for an implementation to pass a text value > through even if it is known that a given header type has a typed codec > equivalent (for instance, Content-Length may come through as a number > or a text value, either will be valid). This means that when > translating from HTTP/1 -> HTTP/2, receiving implementations need to > be prepared to handle either value form. > > This approach combines what I feel are the best concepts from > HeaderDiff and Delta and provides simple, straightforward header > serialization and state management. Obviously, lots of testing is > required, however. > > As always, thoughts/opinions/gripes are appreciated. > > - James
Received on Tuesday, 16 April 2013 21:01:02 UTC