bohe and delta experimentation...

After going a number of scenarios with bohe using a variety of
stream-compression scenarios it's painfully obvious that there is really no
way around the CRIME issue when using stream-compression. So with that, I'm
turning my attention to the use of Roberto's delta encoding and exploring
whether or not binary optimized values can make a significant difference
(as opposed to simply dropping in huffman-encoded text everywhere).

I'm starting with dates first...

Right now, dates in http/1 requests are rather inefficient. The existing
date-time format wastes a significant amount of space, albeit across only a
relatively few headers. On the plus side, these tend to compress well, but
given that the dates change frequently request-to-request, they will be
short-lived in the delta context.

Given this, I decided to run a test scenario for compressing RFC3999 dates
as text vs. using a compact binary encoding. I generated a sample of 100k
randomly generated RFC3999 timestamps that variably include milliseconds
and timezone offsets, I then used that to generate a date-time specific
symbol map and used a static huffman coding. Then, given a sample of 100k
more randomly generated timestamps, the average compression was 12-13 bytes
for the date value. (average length of the uncompressed timestamp is 24
bytes).. so pretty good compression using a symbol tree specifically
optimized for date-times.

By comparison, I devised a simple binary coding for dates using the
following format:

+-+---+---+-------------------+
|M|TZH|TZM|   year (16-bit)   |
+-+---+---+-----+-------------+
| month (4-bit) | day (5-bit) |
+---------------+-------------+
| hour (5-bit)  | minute (6)  |
+---------------+-------------+
| second (6 bit)| millis (31) |
+---------------+-------------+
|d|tz hrs (5 bit)| tz min (6) |
+-----------------------------+

M, TZH and TZM are single bit flags. When M is set, the value includes a
31-bit millisecond field. When TZH is set, it includes timezone offset
hours, and when TZM is set, it includes timezone offset minutes. The d
field (last row) is a single bit indicating positive or negative timezone
offset.

The minimum possible binary encoding is 6-bytes, which includes the first
three flag bits, year, month, day, hour, minute and second. The maximum
possible encoding is 11-bytes which includes full timezone offset and
milliseconds. Giving an average encoding of 8-bytes over any sample size of
randomly generated timestamps.

While the binary encoding is certainly more efficient, I'm not yet certain
if those 4-bytes are worth the effort, but it does improve the overall
compression ratio for the message as a whole.

Either way, regardless of whether we huffman code or binary code the date
values, we should require that RFC3339/ISO8601 timestamps be used for all
date headers within the http/2 header encoding as those are going to
compress much better than the current http/1 date format.

Entity Tags are another area where binary values may be useful. Currently,
ETag values generally tend to be hex or base64 encoded binary data. By
simply allowing the etag to be dropped in as a set of bytes in the encoded
header we can cut the transmitted size of those tags in half. The format
I'm considering for these is:

   +-+------+-----------+
   |W|len(7)| octets... |
   +-+------+-----------+

Where W is a single bit flag indicating weak or not, len is the number of
encoded octets for the entity tag. (I'm wondering, tho, whether or not we
could get away with dropping the entire concept of a "weak entity tag")

By optimizing dates and entity tags this way, we end up with optimized
encodings for a good number of commonly used headers (date, last-modified,
expires, etag, if-none-match, if-match, if-modified-since, etc), and we can
eliminate the need for doing any compression on those values at all.

Another set of headers we can optimize within delta are the numeric values
for Content-Length, :status, Expires, etc. Rather than encoding those as
ascii strings, we would simply encode them as their numeric value.

Will be turning my attention to cookie values next. I'm considering whether
or not we should produce a code-tree that is specific to cookie headers
and/or allow for purely binary values.

- James

Received on Wednesday, 16 January 2013 22:08:10 UTC