- From: Mark Nottingham <mnot@mnot.net>
- Date: Thu, 28 Feb 2013 09:45:08 +1100
- To: James M Snell <jasnell@gmail.com>
- Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Hi James, On 28/02/2013, at 8:16 AM, James M Snell <jasnell@gmail.com> wrote: > As I'm not going to be able to meet y'all in Orlando, I wanted to give > a quick status update on where I'm at with the Header Encoding. > > After much experimentation, implementation and discussion with > Roberto, I've got an implementation that is a good balance between > Delta and BOHE. It uses the fundamental delta encoding mechanism that > Roberto created (with a few notable changes) but allows for four > distinct types of header values: text, numeric, date and raw-binary. > Every header value is encoded with a single additional byte flag that > indicates the value type. Further, the flags indicate whether text is > ISO-8559-1 or UTF-8 and whether it is huffman-coded or not. This > scheme gives us a good balance and allows us to achieve maximum > compression ratios and increased long term functionality without > sacrificing too much in complexity or backwards compatibility. Thanks. That sounds good. > Text values: > > The rules for text values are simple: > > 1. Text can be either UTF-8 or ISO-8859-1 Encoded. A single bit in > the flags is used to indicate which it is. > 2. Text can be huffman coded or not. A single bit in the flags is > used to indicate which. > 3. A single text header field may contain up to 0xFF separate text > strings of arbitrary length. > 4. Each individual text string is prefixed by a unsigned, variable > length integer specifying the length of the string. > > For example, assuming UTF-8 and non-huffman coded values... > > The string value "Bar" is encoded as: > 10 00 03 42 61 72 > > 10 = Flags byte > 00 = Number of values encoded (0-based.. 00 == one value) > 03 = uvarint length of the value > 42 61 72 = UTF-8 bytes > > The multiple string values ["Bar", "Baz"] would encode as: > 10 01 03 42 61 72 03 42 61 7A So, the biggest concern here, I think, is that the conversion of a UTF-8 value to ASCII/Latin-1 -- to be able to forward the header on a HTTP/1.x hop -- requires knowledge of the header. Would you want to define a standard way to encode UTF-8 in Latin-1 (e.g., percent-encoding) for headers that use this? It would constrain the headers (and likely rule out any existing headers from using UTF-8), but I don't see how this is going to be viable otherwise. > Numeric values: > > 1. Numeric values are encoded as variable-length, unsigned integers. > 2. Numeric headers can only have a single encoded value (text values > are the only ones that allow multiples) > > For example, value "100" is encoded as 24 64 > > 25 = Flags byte indicating numeric value > 64 = uvarint encoded value > > Date values: > > 1. Dates are encoded as the number of seconds since a new epoch > (Midnight GMT, Jan 1 1990) > 2. Encoded as uvarints, just like Numeric values, but with an > additional flag bit set > > For example, the date "2013-02-27T12:51:19.409-08:00" is encoded as: > 26 AA A9 BF DC 02 > > 26 = Flags byte indicating date value > AA A9 BF DC 02 = uvarint indicating the date > > Binary values: > > 1. Binary values are encoded as a raw sequence of octets prefixed by > the length encoded as a uvarint > > For example, the byte sequence 01 02 03 is encoded as: > 20 03 01 02 03 Similar problem as with UTF-8, unless you define a single transformation between binary and ASCII (base64?). > Those four encodings would be all we define within the basic header > encoding. At the HTTP semantics layer, specific headers would be > required to explicitly declare support for numeric, date or binary > encoding. This means that all of the existing HTTP header definitions, > which are currently all text, would be required to use the Text value > encoding unless they are specifically redefined to use the new type > options. Specific headers in the core set (:status, content-length, > date, last-modified, etc) would be have updated definitions to allow > those to use the new more compact encodings right from the start. This > allows us to get some immediate benefit out of the box and gives us a > way of smoothly transitioning headers over to more compact and > optimized encoding options later on without sacrificing backwards > compatibility. > > Differences from Delta: > > The scheme I have implemented has a number of technical details that > are different from Roberto's delta implementation. These details are > fairly technical and low level so feel free to ignore them unless you > have a particular interest in implementation details at this point... > > 1. For one, I wrote the thing in Java.. which is good, we now have > at least three different language implementations of the basic delta > scheme. (C++, python and Java) > 2. It uses uvarints for all length prefixing rather than fixed width > integer encodings. > 3. It uses all of the above type encodings > 4. It uses a slightly different mechanism for managing and > renumbering the delta storage state index. This is an internal > difference in implementation but all implementations will be required > to implement the same basic scheme for reindexing internal state so I > am evaluating different options to see which is the most efficient. > 5. It introduces an additional "Ephemeral Clone" opcode which is > essentially a combination of Clone and Eref (it's like clone but > doesn't change state). > 6. It implements additional heuristics for determining when to merge > multiple adjacent toggl indices into a single trang operation. > Basically, it just checks to see if converting to a Trang would save > or cost additional bytes before doing the conversion. > 7. It uses eref or eclone operations for :path, referer, > authorization, www-authenticate, proxy-authenticate, date and > last-modified headers. These tend to change very frequently and take > up a significant amount of space in the storage context. Given how > variable these are, it doesn't make sense to store them. The > likelihood of reuse within a given context is minimal compared to the > cost of storage. > 8. It uses a modified version of the default header dictionary used > by Roberto's delta implementation. I've added additional values and > changed it to use the new value encodings. > 9. I have not yet implemented the additional huffman coding for text values. > > I'm still working on making the delta state management as efficient as > possible in my implementation. The running time for the > serialize-deserialize roundtrip is still well above what I'm happy > with. Part of that is Java's fault, part of it is the delta mechanism > itself. > > Anyway, that's the run down. Still a ways off from having something > that I'd be comfortable calling "complete" but looking pretty good so > far. All good stuff. Thanks again, -- Mark Nottingham http://www.mnot.net/
Received on Wednesday, 27 February 2013 22:45:35 UTC