- From: James M Snell <jasnell@gmail.com>
- Date: Sat, 13 Apr 2013 15:01:24 -0700
- To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Ok... so I've implemented HeaderDiff serialization [1] in java [2]... just the serialization so far, will do deserialization early next week. The implementation is very rough and definitely needs improvement / optimization but it's functional enough to start from. [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00 [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff I know that Roberto is working on refinements to his own concept of the delta header serialization and hopefully he'll be sharing those thoughts soon, but I wanted to get my current thoughts captured for discussion. The HeaderDiff approach is straightforward and effective and makes efficient use of available bits. It's variable length integer syntax is a bit complicated with the notion of bit-prefix lengths but once you've got it it's easy to understand what's going on. The implementation is a bit less complicated than delta, which is good and I like that there's no "Header Group" notion. Headers are either on or off per request. There are several concerns, however. - The true utility of the common prefix length mechanism is questionable. Aside from the potential security risks, I questioning just how effective it's going to be in practice. (What header fields do we expect to actually use it in practice?) - The fact that items are never removed from the Name Table is concerning and poses a potential security risk. The the decompressor is forced to maintain a symbol table for every header name it encounters, a malicious compressor could cause overruns by sending a high number of junk header names. The compressor ought to be able to treat the Name Table as an LRU, and ought to be able to place strict limits on it's size, just as it does with the Header Table. Delta does not have this issue because it's name indices are already tied to the LRU. With HeaderDiff in mind, I'm thinking about how we can bring HeaderDiff and Delta together and roll in the additional concepts of Typed Codecs. Here's what I have in mind. Syntax: The serialization syntax borrows concepts from both HeaderDiff and Delta and is divided into two parts: Basic Header Serialization and Value Serialization. As with Delta currently, there are four operation codes (opcodes), represented by two-bit identifiers. Each op code has a single-bit "ephemeral flag". Operations are serialized into batches of no more than 32 instances. 00 0 00000 { HEADER } The first two most significant bits... 00 = Index (what Delta calls "Toggl") 01 = Index Range (what Delta calls "Trang") 10 = Clone 11 = Literal (what Delta calls "KVSto") The third most significant bit is the ephemeral toggle 0 = Persist (the header value is to be persisted in the LRU table) 1 = Ephemeral (the header value is not to be persisted in the LRU table) The remaining five bits specify the number of distinct instances in this batch. This counter is 0-based, so 0x0 = 1 item in the batch, 0x1F = 32 items in the batch. The ephemeral flag is ignored for opcodes 00 and 01. Indexed headers are identified by a single byte index value. 0 0000000 The first most significant bit indicates whether or not the index is part of the static header dictionary or lru cache. 0 = LRU 1 = Static This gives us a strict maximum of 128 storage slots for the LRU. That's limited *on purpose* to place strictly finite constraints on the amount of state any single compressor can require. This also ensures that we are always able to encode indices using only a single byte. As for state management, we would essentially use the Header Diff model minus the separate Name Table. That is, there is no persistent Header Group in use; just the Header Table. State management depends on the specific opcodes used. State management occurs as the headers are encountered in the serialized header block. 00 = Use an indexed header instance from the header table for this request 01 = Use multiple sequentially indexed header instances from the header table for this request 10 = Use a new value for an already existing header name, index references an existing header from the table, persist the new value if the ephemeral bit is not set. 11 = Use a new key and value for this request. Persist the new pair if the ephemeral bit is not set. So, for example, the serialization: 00000000 10000000 = Use index #0 from the static dictionary 00000001 10000000 00000001 = Use index #0 from the static dictionary and index #1 from the lru cache 01000001 10000000 10000001 00000001 00000010 = Use indices #0-#1 from the static dictionary and indices #1-#2 from the lru cache 10000000 10000000 {VALUE} Provide a new value for the header name identified by index #0 in the static dictionary and persist the value in the lru cache 10100000 10000000 {VALUE} Provide a new value for the header name identified by index #0 in the static dictionary but do not persist the value in the lru cache Non-ephemeral items are pushed into the next available slot in the LRU. Once all available slots have been used, the indices rotate. The LRU can also be limited by storage space. As storage space is required, the least recently items are dropped. Keeping in mind that there are only 128 available storage slots, it will be important for compressors to be selective about which fields are persisted. (note.. the LRU is really a "least recently written" cache.. not "least recently used"...) These values are tracked only for the current message. There is no Header Group as currently defined by Delta so there is no need to track whether or not any given header is currently on or off. It's either referenced for this request or it's not. If there are more than 32 instances of any single opcode per serialized header block (which is unlikely), a single opcode may appear multiple times within the serialized block. As far as Value serialization is concerned... As I have discussed previously, I have been looking at support for four distinct value types, each represented by their own two-bit identifier. As I will explain further in a second, these types are represented using the two most significant bits in a byte: 00 = Text 01 = Number 10 = Date Time 11 = Binary Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag following the type code. All text strings are prefixed by it's length given as an unsigned variant length integer Numbers are encoded as unsigned variable length integers. Date Times are encoded as Numbers representing the number of seconds that have passed since the epoch (GMT). Binary data is encoded as raw binary octets prefixed by an unsigned variable length integer specifying the number of bytes. Any single header value instance can consist of up to 32 distinct value tokens. The number of tokens is specified using the 5 least significant bytes of type type flag. For instance: 00000000 00000001 01100001 = A single ISO-8859-1 Text value with length = 1 00100000 00000001 01100001 = A single UTF-8 Text value with length = 1 00100001 00000001 01100001 00000001 01100010 = Two UTF-8 Text values, each with length = 1 01000000 00000001 = A single Number value 10000000 00000001 = A single Date value (1 second after the epoch) 11000000 00000001 00000001 = A single Raw Binary value For ISO-8859-1 Text, the Static Huffman Code used by Delta would be used for the value. If we can develop an approach to effectively handling Huffman coding for arbitrary UTF-8, then we can apply Huffman coding to that as well. For the Number and DateTime serialization: - 16-bit numbers serialize with a strict maximum of 3 bytes - 32-bit numbers serialize with a strict maximum of 5 bytes. - 64-bit numbers serialize with a strict maximum of 10 bytes. - Date Times will serialize with five bytes for the reasonably relevant future, then six bytes for quite some time after that. Dates prior to the epoch cannot be represented. In order to properly deal with the backwards compatibility concerns for HTTP/1, there are several important rules for use of Typed Codecs in HTTP headers: 1. Headers must be explicitly defined to use the new header types. All existing HTTP/1 headers, then, will continue to be required to be represented as ISO-8859-1 Text unless their standard definitions are updated. The HTTP/2 specification would update the definition of specific known headers (e.g. content-length, date, if-modified-since, etc). 2. Extension headers that use the typed codecs will have specific normative transformations to ISO-8859-1 defined. a. UTF-8 Text will be converted to ISO-8859-1 with extended characters pct-encoded b. Numbers will be converted to their ASCII equivalent values. c. Date Times will be converted to their HTTP-Date equivalent values. d. Binary fields will be Base64-encoded. 3. There will be no normative transformation from ISO-8859-1 values into the typed codecs. Implementations are free to apply transformation where those impls determine it is appropriate, but it will be perfectly legal for an implementation to pass a text value through even if it is known that a given header type has a typed codec equivalent (for instance, Content-Length may come through as a number or a text value, either will be valid). This means that when translating from HTTP/1 -> HTTP/2, receiving implementations need to be prepared to handle either value form. This approach combines what I feel are the best concepts from HeaderDiff and Delta and provides simple, straightforward header serialization and state management. Obviously, lots of testing is required, however. As always, thoughts/opinions/gripes are appreciated. - James
Received on Saturday, 13 April 2013 22:02:11 UTC