Re: Delta Compression and UTF-8 Header Values

On Saturday, February 9, 2013, "Martin J. Dürst" wrote:

> Hello James, others,
>
> On 2013/02/09 4:28, James M Snell wrote:
>
>> One key challenge with allowing UTF-8 values, however, is that it
>> conflicts with the use of the static huffman encoding in the proposed
>> Delta Encoding for header compression. If we allow for non-ascii
>> characters, the static huffman coding simply becomes too inefficient
>> and unmanageable to be useful. There are a few ways around it but none
>> of the strategies are all that attractive.
>
>
Wait, what?  If you have non-English (worse, non-European) text in some
ASCII encoding like punycode, or base64-encoded UTF-8, then static Huffman
will not be useful for compression anyways (assuming Huffman coding is
based on English -say- letter frequencies).


> [If somebody has pointers to actual code, that would be appreciated. I
> can't work on it for the next two weeks, but after that, I should be able
> to use a day or two to see what's possible.]
>
> For a static Huffman encoding, you have to decide what symbols you accept
> as input, give every symbol a probability (these have to add up to 1) and
> then you get the 'optimal' "comma-free" encoding using the algorithm
> devised by Huffman. Optimal is under the assumptions that the probabilities
> are correct (and independent) and that you have to use an integral number
> of bits per symbol. Arithmetic coding gets rid of the second restriction,
> to get rid of the first, one creates a more complex model. Comma-free just
> means you don't have to guess where the bits for one symbol end and those
> for the next symbol start.


Right.  i hope i put it more succintly above.

The fact is that Huffman coding for all our scripts at once just isn't
possible.  Static Huffman coding is not a good reason to not want UTF-8 or
any other Unicode encoding.

Nico
--

Received on Sunday, 10 February 2013 08:17:29 UTC