Re: Delta Compression and UTF-8 Header Values

Hello James, others,

On 2013/02/09 4:28, James M Snell wrote:
> One key challenge with allowing UTF-8 values, however, is that it
> conflicts with the use of the static huffman encoding in the proposed
> Delta Encoding for header compression. If we allow for non-ascii
> characters, the static huffman coding simply becomes too inefficient
> and unmanageable to be useful. There are a few ways around it but none
> of the strategies are all that attractive.

[If somebody has pointers to actual code, that would be appreciated. I 
can't work on it for the next two weeks, but after that, I should be 
able to use a day or two to see what's possible.]

For a static Huffman encoding, you have to decide what symbols you 
accept as input, give every symbol a probability (these have to add up 
to 1) and then you get the 'optimal' "comma-free" encoding using the 
algorithm devised by Huffman. Optimal is under the assumptions that the 
probabilities are correct (and independent) and that you have to use an 
integral number of bits per symbol. Arithmetic coding gets rid of the 
second restriction, to get rid of the first, one creates a more complex 
model. Comma-free just means you don't have to guess where the bits for 
one symbol end and those for the next symbol start.

I'm not sure how the proposals came up with the probabilities, but I 
guess that was based on some data. Even though HTTP headers aren't 
English, I'd assume that 'e' has significantly higher probability than 
'Q' :-).

To make sure this works for UTF-8, the easiest way is to take bytes as 
input symbols. Some byte values with the high bit set can't appear in 
UTF-8, so these can be excluded. I'm not sure what the current proposals 
do, but I guess they also exclude many of the byte values in the C0 
range (0x00-0x1F), or at least give them very low probabilities.

A somewhat more complicated, but more efficient model would take the 
state of the UTF-8 'engine' into account, and have separate Huffman 
encoders for the various states, i.e. first byte of a character, second 
byte of a two-byte character, second byte of a three-byte character, 
third byte of a three-byte character, second/third/fourth byte of a 
four-byte character. Most if not all of the later ones can use the same 
Huffman encoder, and most probably can just assume a flat probability 
distribution and encode six raw bits (i.e. just eliminate the leading 
0b10 in trailing UTF-8 bytes).

If we think that ASCII is way more probable than UTF-8, then the only 
thing that will happen to ASCII is that one of the longest bit 
sequences, the one for the least probable ASCII byte, will get one bit 
longer. Of course, if we think that UTF-8 data is more probable, then 
there will be more changes, but these changes would be duly deserved.

Regards,    Martin.

Received on Sunday, 10 February 2013 04:32:15 UTC