- From: James M Snell <jasnell@gmail.com>
- Date: Wed, 27 Feb 2013 13:16:16 -0800
- To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
As I'm not going to be able to meet y'all in Orlando, I wanted to give a quick status update on where I'm at with the Header Encoding. After much experimentation, implementation and discussion with Roberto, I've got an implementation that is a good balance between Delta and BOHE. It uses the fundamental delta encoding mechanism that Roberto created (with a few notable changes) but allows for four distinct types of header values: text, numeric, date and raw-binary. Every header value is encoded with a single additional byte flag that indicates the value type. Further, the flags indicate whether text is ISO-8559-1 or UTF-8 and whether it is huffman-coded or not. This scheme gives us a good balance and allows us to achieve maximum compression ratios and increased long term functionality without sacrificing too much in complexity or backwards compatibility. Text values: The rules for text values are simple: 1. Text can be either UTF-8 or ISO-8859-1 Encoded. A single bit in the flags is used to indicate which it is. 2. Text can be huffman coded or not. A single bit in the flags is used to indicate which. 3. A single text header field may contain up to 0xFF separate text strings of arbitrary length. 4. Each individual text string is prefixed by a unsigned, variable length integer specifying the length of the string. For example, assuming UTF-8 and non-huffman coded values... The string value "Bar" is encoded as: 10 00 03 42 61 72 10 = Flags byte 00 = Number of values encoded (0-based.. 00 == one value) 03 = uvarint length of the value 42 61 72 = UTF-8 bytes The multiple string values ["Bar", "Baz"] would encode as: 10 01 03 42 61 72 03 42 61 7A Numeric values: 1. Numeric values are encoded as variable-length, unsigned integers. 2. Numeric headers can only have a single encoded value (text values are the only ones that allow multiples) For example, value "100" is encoded as 24 64 25 = Flags byte indicating numeric value 64 = uvarint encoded value Date values: 1. Dates are encoded as the number of seconds since a new epoch (Midnight GMT, Jan 1 1990) 2. Encoded as uvarints, just like Numeric values, but with an additional flag bit set For example, the date "2013-02-27T12:51:19.409-08:00" is encoded as: 26 AA A9 BF DC 02 26 = Flags byte indicating date value AA A9 BF DC 02 = uvarint indicating the date Binary values: 1. Binary values are encoded as a raw sequence of octets prefixed by the length encoded as a uvarint For example, the byte sequence 01 02 03 is encoded as: 20 03 01 02 03 Those four encodings would be all we define within the basic header encoding. At the HTTP semantics layer, specific headers would be required to explicitly declare support for numeric, date or binary encoding. This means that all of the existing HTTP header definitions, which are currently all text, would be required to use the Text value encoding unless they are specifically redefined to use the new type options. Specific headers in the core set (:status, content-length, date, last-modified, etc) would be have updated definitions to allow those to use the new more compact encodings right from the start. This allows us to get some immediate benefit out of the box and gives us a way of smoothly transitioning headers over to more compact and optimized encoding options later on without sacrificing backwards compatibility. Differences from Delta: The scheme I have implemented has a number of technical details that are different from Roberto's delta implementation. These details are fairly technical and low level so feel free to ignore them unless you have a particular interest in implementation details at this point... 1. For one, I wrote the thing in Java.. which is good, we now have at least three different language implementations of the basic delta scheme. (C++, python and Java) 2. It uses uvarints for all length prefixing rather than fixed width integer encodings. 3. It uses all of the above type encodings 4. It uses a slightly different mechanism for managing and renumbering the delta storage state index. This is an internal difference in implementation but all implementations will be required to implement the same basic scheme for reindexing internal state so I am evaluating different options to see which is the most efficient. 5. It introduces an additional "Ephemeral Clone" opcode which is essentially a combination of Clone and Eref (it's like clone but doesn't change state). 6. It implements additional heuristics for determining when to merge multiple adjacent toggl indices into a single trang operation. Basically, it just checks to see if converting to a Trang would save or cost additional bytes before doing the conversion. 7. It uses eref or eclone operations for :path, referer, authorization, www-authenticate, proxy-authenticate, date and last-modified headers. These tend to change very frequently and take up a significant amount of space in the storage context. Given how variable these are, it doesn't make sense to store them. The likelihood of reuse within a given context is minimal compared to the cost of storage. 8. It uses a modified version of the default header dictionary used by Roberto's delta implementation. I've added additional values and changed it to use the new value encodings. 9. I have not yet implemented the additional huffman coding for text values. I'm still working on making the delta state management as efficient as possible in my implementation. The running time for the serialize-deserialize roundtrip is still well above what I'm happy with. Part of that is Java's fault, part of it is the delta mechanism itself. Anyway, that's the run down. Still a ways off from having something that I'd be comfortable calling "complete" but looking pretty good so far. - James
Received on Wednesday, 27 February 2013 21:17:03 UTC