Re: UTF-8 or ASCII Header Names?

I view it as liberating-- as the compressor is now freed from worrying
about normalization, etc. which, if done, should be done at a higher layer.

There is currently exactly one field that the compressor makes assumptions
about and we could change that by requiring that the HTTP-layer do the
transformation of cookie into cookie-crumbs instead of having the
compressor do it. The compressor knows zero about anything else,
semantically right now.

The huffman encoder that we had and will likely add back worked on bytes.
It mostly encountered ASCII, and thus the frequency table was skewed to
compress ASCII better than other things, but it could still handle UTF-8,
raw binary, whatever.

I could certainly see an eventual future where some values are just raw
binary.
Sure, the huffman-based encoder would not compress that very well, but that
is OK-- the binary rep should already be fairly small in comparison to the
B64 encoding we do today (I'd rather have the data remain the same size
than getting a 30% decrease after a 4X expansion, which is what would
happen today...), and an escape valve of not having to use the huffman
encoding has always been the plan.

We could still allow for compressors to do things with semantic knowledge,
but there is no need to *require* it by declaring the type of all values a
prior.
Simply require that any transformation the compressor does must not change
the semantic meaning of the value. Problem solved, I think.

-=R

-=R


On Fri, Aug 16, 2013 at 9:19 AM, Martin Thomson <martin.thomson@gmail.com>wrote:

> On 16 August 2013 08:44, Roberto Peon <grmocg@gmail.com> wrote:
> > The keys should be ASCII, and the values bytes.
>
> That's a fairly narrow view.  If the values were (for example) ASCII,
> then you'd have an opportunity to compress better.  At worst, you can
> wipe the high order bit from every octet.
>
> At some level you are going to need to either make assumptions about
> the properties of values, or rely on specific knowledge about them if
> you are going to compress effectively.  Even if it were the case that
> the bytes were UTF-8, you could still make some gains over pure bytes
> (even just by exploiting the fact that certain byte sequences are not
> possible in UTF-8).
>

Received on Friday, 16 August 2013 16:30:25 UTC