Re: delta encoding and state management

On Thu, Jan 17, 2013 at 2:52 PM, Roberto Peon <grmocg@gmail.com> wrote:
> Can you support that with data? The data I've seen thusfar strongly
> disproves that assertion...

No need: it follows from basic information theory that a generic
compression algorithm cannot do better than a minimal encoding of the
same data.  The key is that the encoding has to be truly minimal,
otherwise all bets are off.

Of course, the encoding may not really be always minimal.  A varint
encoding of seconds since epoch is not minimal if the epoch is far in
the past, but resetting the epoch has its costs, and gzip and friends
won't know anything about that, -- in time a datetime encoding like
seconds-since-epoch will become less than minimal.

We probably cannot get truly minimal encodings of everything we care
about, but as we saw with the datetime encoding/compression
sub-thread, in some cases we can do much better than generic encoding.
 The thing about generic compression is that it has to look for
repeating patterns, but some patterns are hard to detect, and
dictionaries add overhead, which is why we stand a very good chance of
doing at least a satisfactory job with minimal encodings.  And if we
can avoid long-term per-connection compression state then that's a big
architectural win.  It doesn't have to be the case that we compress
better this way than the other, just well enough.

Nico
--

Received on Thursday, 17 January 2013 21:19:32 UTC