Re: bohe and delta experimentation... from Mark Nottingham on 2013-01-16 (ietf-http-wg@w3.org from January to March 2013)

From: Mark Nottingham <mnot@mnot.net>
Date: Thu, 17 Jan 2013 09:28:59 +1100
To: James M Snell <jasnell@gmail.com>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-Id: <2FD0BBE1-59C6-4E49-ACCE-60C1A895FB7D@mnot.net>
On 17/01/2013, at 9:07 AM, James M Snell <jasnell@gmail.com> wrote:

> After going a number of scenarios with bohe using a variety of stream-compression scenarios it's painfully obvious that there is really no way around the CRIME issue when using stream-compression. So with that, I'm turning my attention to the use of Roberto's delta encoding and exploring whether or not binary optimized values can make a significant difference (as opposed to simply dropping in huffman-encoded text everywhere).
> 
> I'm starting with dates first...
> 
> Right now, dates in http/1 requests are rather inefficient. The existing date-time format wastes a significant amount of space, albeit across only a relatively few headers. On the plus side, these tend to compress well, but given that the dates change frequently request-to-request, they will be short-lived in the delta context. 
> 
> Given this, I decided to run a test scenario for compressing RFC3999 dates as text vs. using a compact binary encoding. I generated a sample of 100k randomly generated RFC3999 timestamps that variably include milliseconds and timezone offsets, I then used that to generate a date-time specific symbol map and used a static huffman coding. Then, given a sample of 100k more randomly generated timestamps, the average compression was 12-13 bytes for the date value. (average length of the uncompressed timestamp is 24 bytes).. so pretty good compression using a symbol tree specifically optimized for date-times.
> 
> By comparison, I devised a simple binary coding for dates using the following format:
> 
> +-+---+---+-------------------+
> |M|TZH|TZM|   year (16-bit)   |
> +-+---+---+-----+-------------+
> | month (4-bit) | day (5-bit) |
> +---------------+-------------+
> | hour (5-bit)  | minute (6)  |
> +---------------+-------------+
> | second (6 bit)| millis (31) |
> +---------------+-------------+
> |d|tz hrs (5 bit)| tz min (6) |
> +-----------------------------+
> 
> M, TZH and TZM are single bit flags. When M is set, the value includes a 31-bit millisecond field. When TZH is set, it includes timezone offset hours, and when TZM is set, it includes timezone offset minutes. The d field (last row) is a single bit indicating positive or negative timezone offset.
> 
> The minimum possible binary encoding is 6-bytes, which includes the first three flag bits, year, month, day, hour, minute and second. The maximum possible encoding is 11-bytes which includes full timezone offset and milliseconds. Giving an average encoding of 8-bytes over any sample size of randomly generated timestamps.
> 
> While the binary encoding is certainly more efficient, I'm not yet certain if those 4-bytes are worth the effort, but it does improve the overall compression ratio for the message as a whole.

Just for comparison - in the simple encoding, I get it to 8 bytes (at least before year 2038), and it's textual (just the hex for the number of seconds since the epoch).

E.g.,

HTTP/1: Sat, 03 Nov 2012 13:05:03 GMT
Simple: 5095167f


> Either way, regardless of whether we huffman code or binary code the date values, we should require that RFC3339/ISO8601 timestamps be used for all date headers within the http/2 header encoding as those are going to compress much better than the current http/1 date format.
> 
> Entity Tags are another area where binary values may be useful. Currently, ETag values generally tend to be hex or base64 encoded binary data.

That's a big assumption!


> By simply allowing the etag to be dropped in as a set of bytes in the encoded header we can cut the transmitted size of those tags in half. The format I'm considering for these is:
> 
>    +-+------+-----------+
>    |W|len(7)| octets... |
>    +-+------+-----------+
> 
> Where W is a single bit flag indicating weak or not, len is the number of encoded octets for the entity tag. (I'm wondering, tho, whether or not we could get away with dropping the entire concept of a "weak entity tag")

We decided not to drop it in HTTPbis, so assume that it'll stay (given we need to be able to convert 1 to 2 and back).


> By optimizing dates and entity tags this way, we end up with optimized encodings for a good number of commonly used headers (date, last-modified, expires, etag, if-none-match, if-match, if-modified-since, etc), and we can eliminate the need for doing any compression on those values at all.
> 
> Another set of headers we can optimize within delta are the numeric values for Content-Length, :status, Expires, etc. Rather than encoding those as ascii strings, we would simply encode them as their numeric value.
> 
> Will be turning my attention to cookie values next. I'm considering whether or not we should produce a code-tree that is specific to cookie headers and/or allow for purely binary values.

I could imagine setting a parameter on Set-Cookie that indicates its content is encoded in a certain way, which can be replayed as binary data. However, that information would also need to be in Cookie, which I *think* necessitates a new request header -- maybe Bookie?


--
Mark Nottingham   http://www.mnot.net/
Received on Wednesday, 16 January 2013 22:29:28 UTC