Re: bohe and delta experimentation... from Frédéric Kayser on 2013-01-17 (ietf-http-wg@w3.org from January to March 2013)

From: Frédéric Kayser <f.kayser@free.fr>
Date: Thu, 17 Jan 2013 21:16:25 +0100 (CET)
To: Roberto Peon <grmocg@gmail.com>, ietf-http-wg@w3.org
Message-ID: <1189053143.221044883.1358453785232.JavaMail.root@zimbra71-e12.priv.proxad.net>
Hello,
state of the art compressors revert base64 encoding before compression et redo it during decompression (would be a nice feature to have when dealing with HTML/CSS files embedding files using Data URI scheme RFC 2397, but that's not in the header).

I find BOHE interesting but I don't like the idea of mixing what is basically ASCII text (7 bits values with most of the C0 control codes never used) and binary data (full 8 bits wide values, the way it's used in BOHE it also means a lot of values in the low end -near zero-).
An entropy coder will have to cope with a larger alphabet than just ASCII since it is "polluted" by binary data (if it appears infrequently it may end up taking more than 8-bits).

I would like to keep BOHE main principles (build a pre-agreed dictionary of the most common terms, record numerical values in "binary" and not text) but without introducing 8-bits values (avoiding the [00..1F] range would be nice too). It could lead to smaller Huffman tables (eventually order 1 context modelling) and less header overhead. A superfast 8 7-bits values to 7 8-bits values transformation could also be performed instead of using a full-featured entropy coder (Huffman or arithmetic coding), it could be interesting for HTTP servers to have different compression options.

Another point that has not been mentioned here is internationalisation:
http://www.w3.org/International/articles/idn-and-iri/

Why not get rid of punycode and percent-escaping in the headers? Converting them to plain UTF-8 (could be followed by an UTF-8 to 7-bits transformation).

Regards

----- Mail original -----
De: "Roberto Peon" <grmocg@gmail.com>
À: "Nico Williams" <nico@cryptonector.com>
Cc: "James M Snell" <jasnell@gmail.com>, ietf-http-wg@w3.org
Envoyé: Jeudi 17 Janvier 2013 00:13:24
Objet: Re: bohe and delta experimentation...


Everything ends up as binary on either side today. So long as it arrives in the form that was transmitted, it doesn't matter what the encoding is. 
Cookies today are all mostly encoded in something from that binary, with the most likely being base-64 encoding. Base-64 encoding is highly compressible (using entropy coding). Encryption makes LZ77 and its ilk not efficient, but that is a separate thingie entirely :) 


-=R 



On Wed, Jan 16, 2013 at 2:47 PM, Nico Williams < nico@cryptonector.com > wrote: 



On Wed, Jan 16, 2013 at 4:07 PM, James M Snell < jasnell@gmail.com > wrote: 
> After going a number of scenarios with bohe using a variety of 
> stream-compression scenarios it's painfully obvious that there is really no 
> way around the CRIME issue when using stream-compression. So with that, I'm 
> turning my attention to the use of Roberto's delta encoding and exploring 
> whether or not binary optimized values can make a significant difference (as 
> opposed to simply dropping in huffman-encoded text everywhere). 

Well, we could pursue enhanced session continuation (Phillip's term 
for using MACs taken over nonces, among other things, instead of or in 
addition to cookies). 

But I think it'd be nice to explore compression of header names and of 
some header values. So let's, 


> I'm starting with dates first... 
> 
>[...] 

> 
> By comparison, I devised a simple binary coding for dates using the 
> following format: 
> 
> +-+---+---+-------------------+ 
> |M|TZH|TZM| year (16-bit) | 
> +-+---+---+-----+-------------+ 
> | month (4-bit) | day (5-bit) | 
> +---------------+-------------+ 
> | hour (5-bit) | minute (6) | 
> +---------------+-------------+ 
> | second (6 bit)| millis (31) | 
> +---------------+-------------+ 
> |d|tz hrs (5 bit)| tz min (6) | 
> +-----------------------------+ 
> 
> M, TZH and TZM are single bit flags. When M is set, the value includes a 
> 31-bit millisecond field. When TZH is set, it includes timezone offset 
> hours, and when TZM is set, it includes timezone offset minutes. The d field 
> (last row) is a single bit indicating positive or negative timezone offset. 

You don't need 31 bits for milliseconds; 10 will do! But sure, it's 
nice to be able to get to microseconds, in which case 20 bits should 
suffice, or nanoseconds, in which case 30 bits should suffice. In no 
case do we need 31 bits for fractions of seconds. But at best we save 
21 bits -- two bytes, or, if we're lucky, three. 


> The minimum possible binary encoding is 6-bytes, which includes the first 
> three flag bits, year, month, day, hour, minute and second. The maximum 
> possible encoding is 11-bytes which includes full timezone offset and 
> milliseconds. Giving an average encoding of 8-bytes over any sample size of 
> randomly generated timestamps. 

But if everyone chooses to send the max then it's 11 vs. the 12 you 
got with date string compression. Too trivial a gain? 

Of course, an encoding that uses, say, 44 bits for twos-complement (do 
we need negative dates for this?) seconds since the Unix epoch + 20 
for microseconds would always be 8 bytes, but we'd get no TZ 
information, and TZ info would require at least two more bytes so... 
we're back to about 10-12 bytes. If we could do with just 34 bits for 
seconds w/o negative dates we're getting closer to always 8 bytes. 
And if we could do with just 33 bits for seconds ... we'd get to 
exactly 8 bytes but at the price of a 2,242 year problem. 

What if we use julian day? Then we'd need 31 bits for days (which 
allows us to go 1000 years into the future), 16 bits for seconds and 
milliseconds, and now we're at 6 bytes + two more for TZ data. And if 
we encode TZ offset in terms of 15 minute increments then we get down 
to just 7 bytes for the whole thing. Seven bytes is pretty good, but 
is it good enough to bother with this? 

We can do slightly better if we don't allow dates in the past, set a 
new epoch, and limit how far into the future our dates will go (we can 
always allow for encoding far-future dates with many more bytes). I 
think we can probably get down to 6 bytes for dates, including TZ 
information and milliseconds for the next few decades then go up to 7 
bytes and so on. 


> Will be turning my attention to cookie values next. I'm considering whether 
> or not we should produce a code-tree that is specific to cookie headers 
> and/or allow for purely binary values. 

Where cookies bear encrypted session state you won't be able to 
compress them at all. And it's not like the server can't do the 
effort to set maximally-compressed cookies -- it should! IMO: leave 
cookies alone. 

Nico 
--
Received on Thursday, 17 January 2013 20:16:57 UTC