Re: UTF-8 or ASCII Header Names?

On Fri, Aug 16, 2013 at 9:29 AM, Roberto Peon <grmocg@gmail.com> wrote:
> I view it as liberating-- as the compressor is now freed from worrying about
> normalization, etc. which, if done, should be done at a higher layer.
>

FWIW, I don't believe anyone had said anything about normalization...
valid UTF-8 octets, yes, but not normalization. The compression
mechanism is really not affected by whether or not we say UTF-8
here...

- James

> There is currently exactly one field that the compressor makes assumptions
> about and we could change that by requiring that the HTTP-layer do the
> transformation of cookie into cookie-crumbs instead of having the compressor
> do it. The compressor knows zero about anything else, semantically right
> now.
>
> The huffman encoder that we had and will likely add back worked on bytes. It
> mostly encountered ASCII, and thus the frequency table was skewed to
> compress ASCII better than other things, but it could still handle UTF-8,
> raw binary, whatever.
>
> I could certainly see an eventual future where some values are just raw
> binary.
> Sure, the huffman-based encoder would not compress that very well, but that
> is OK-- the binary rep should already be fairly small in comparison to the
> B64 encoding we do today (I'd rather have the data remain the same size than
> getting a 30% decrease after a 4X expansion, which is what would happen
> today...), and an escape valve of not having to use the huffman encoding has
> always been the plan.
>
> We could still allow for compressors to do things with semantic knowledge,
> but there is no need to *require* it by declaring the type of all values a
> prior.
> Simply require that any transformation the compressor does must not change
> the semantic meaning of the value. Problem solved, I think.
>
> -=R
>
> -=R
>
>
> On Fri, Aug 16, 2013 at 9:19 AM, Martin Thomson <martin.thomson@gmail.com>
> wrote:
>>
>> On 16 August 2013 08:44, Roberto Peon <grmocg@gmail.com> wrote:
>> > The keys should be ASCII, and the values bytes.
>>
>> That's a fairly narrow view.  If the values were (for example) ASCII,
>> then you'd have an opportunity to compress better.  At worst, you can
>> wipe the high order bit from every octet.
>>
>> At some level you are going to need to either make assumptions about
>> the properties of values, or rely on specific knowledge about them if
>> you are going to compress effectively.  Even if it were the case that
>> the bytes were UTF-8, you could still make some gains over pure bytes
>> (even just by exploiting the fact that certain byte sequences are not
>> possible in UTF-8).
>
>

Received on Friday, 16 August 2013 16:50:03 UTC