Re: Delta Compression and UTF-8 Header Values

I'll point out that the only reason that we're talking about static
huffmans is because it isn't safe to have a constantly mutating huffman
encoding over the lifetime of the session.
Perhaps it is reasonable to negotiate a different table in the first
exchange, dunno, but it is certainly techincailly feasible.

That being said, dealing with utf-8 and unicode in metadata that isn't
being shown to the user seems silly to me.
For data which is, however, then picking one single encoding for it would
be nice, even if it is utf-8 :)
-=R


On Sun, Feb 10, 2013 at 12:17 AM, Nico Williams <nico@cryptonector.com>wrote:

> On Saturday, February 9, 2013, "Martin J. Dürst" wrote:
>
>> Hello James, others,
>>
>> On 2013/02/09 4:28, James M Snell wrote:
>>
>>> One key challenge with allowing UTF-8 values, however, is that it
>>> conflicts with the use of the static huffman encoding in the proposed
>>> Delta Encoding for header compression. If we allow for non-ascii
>>> characters, the static huffman coding simply becomes too inefficient
>>> and unmanageable to be useful. There are a few ways around it but none
>>> of the strategies are all that attractive.
>>
>>
> Wait, what?  If you have non-English (worse, non-European) text in some
> ASCII encoding like punycode, or base64-encoded UTF-8, then static Huffman
> will not be useful for compression anyways (assuming Huffman coding is
> based on English -say- letter frequencies).
>
>
>> [If somebody has pointers to actual code, that would be appreciated. I
>> can't work on it for the next two weeks, but after that, I should be able
>> to use a day or two to see what's possible.]
>>
>> For a static Huffman encoding, you have to decide what symbols you accept
>> as input, give every symbol a probability (these have to add up to 1) and
>> then you get the 'optimal' "comma-free" encoding using the algorithm
>> devised by Huffman. Optimal is under the assumptions that the probabilities
>> are correct (and independent) and that you have to use an integral number
>> of bits per symbol. Arithmetic coding gets rid of the second restriction,
>> to get rid of the first, one creates a more complex model. Comma-free just
>> means you don't have to guess where the bits for one symbol end and those
>> for the next symbol start.
>
>
> Right.  i hope i put it more succintly above.
>
> The fact is that Huffman coding for all our scripts at once just isn't
> possible.  Static Huffman coding is not a good reason to not want UTF-8 or
> any other Unicode encoding.
>
> Nico
> --
>

Received on Sunday, 10 February 2013 09:00:47 UTC