Re: bohe and delta experimentation... from Frédéric Kayser on 2013-01-18 (ietf-http-wg@w3.org from January to March 2013)

From: Frédéric Kayser <f.kayser@free.fr>
Date: Fri, 18 Jan 2013 09:11:48 +0100
To: ietf-http-wg@w3.org, Nico Williams <nico@cryptonector.com>
Cc: Roberto Peon <grmocg@gmail.com>
Message-Id: <4E80B9B4-1B3D-4452-A1B1-D7C1959C2755@free.fr>

Well, it's basic information theory, the header in its text form can be seen as an information source (with it's own alphabet and own probability distribution), binary data is a different one (bigger alphabet) with a different probability distribution (for instance with a bias toward zero if references to common strings are numbered starting from zero).

If you treat these two different information sources as a single source (since they end up in the same file -header in this case-) you'll have to cope with a bigger alphabet and a probability distribution that never really fits either of the sources (this can be seen like two players speaking different languages trying to play Scrabble forming words in their mother tongue but the game set is in fact designed for a third language or rather something like Esperanto, it will work but only up to a point and with a limited gaming experience).

For better compression both sources should be handled separately and this could be done with BOHE since fields are being tagged as text or binary (we could even imagine handling it as 3 sources:  text fields, binary fields and BOHE structure), but for relatively small data samples (hopefully headers are smaller than 4 kB) providing 2 or 3 Huffman tables could prove itself inefficient because of the overhead involved.
Dynamic Huffman could be a solution but it implies more computation and may take to much time to reflect the underlying probability distribution (slow start).
Another way would be to mimic one information source characteristics with the other one, this works if data coming from one source outnumbers the other one (for instance 2 or 3 times more text than binary data) in this case binary data could be transformed to look like text (limited to 7-bits, avoid C0 control codes) it works even better if a bias in one source can be remapped in the other one (i.e. binary values near zero become lowercase letters).

If the compression method cannot use LZ77/LZW/BWT (to hinder CRIME attacks) we should pay special attention to data modeling and the entropy coders to achieve interesting gains.

Frédéric

Le 17 janv. 2013 à 21:50, Nico Williams a écrit :

> On Thu, Jan 17, 2013 at 2:16 PM, Frédéric Kayser <f.kayser@free.fr> wrote:
>> I find BOHE interesting but I don't like the idea of mixing what is basically ASCII text (7 bits values with most of the C0 control codes never used) and binary data (full 8 bits wide values, the way it's used in BOHE it also means a lot of values in the low end -near zero-).
> 
> This is silly: we already have text and binary mixed all over the
> place, such as in DNS, PKIX, Kerberos, SSHv2, and so on, and that's
> never been a problem.  And how else should we encode textual values in
> binary protocols anyways?  (I mean, besides using a Unicode encoding
> rather than ASCII.)
> 
>> An entropy coder will have to cope with a larger alphabet than just ASCII since it is "polluted" by binary data (if it appears infrequently it may end up taking more than 8-bits).
> 
> Well, yes, so what?  We already apply generic compression algorithms
> to so much mixed text and binary.  Clearly information theory does not
> say it's impossible.
> 
> Nico
> --
>

Received on Friday, 18 January 2013 08:12:18 UTC