Re: Delta Compression and UTF-8 Header Values

The code checked in for the delta encoder either already encodes binary
values, or can be set to do so easily. The c++ version did-- the "unused"
symbols ended up with 32-bit outputs.
-=R


On Sat, Feb 9, 2013 at 8:31 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote:

> Hello James, others,
>
>
> On 2013/02/09 4:28, James M Snell wrote:
>
>> One key challenge with allowing UTF-8 values, however, is that it
>> conflicts with the use of the static huffman encoding in the proposed
>> Delta Encoding for header compression. If we allow for non-ascii
>> characters, the static huffman coding simply becomes too inefficient
>> and unmanageable to be useful. There are a few ways around it but none
>> of the strategies are all that attractive.
>>
>
> [If somebody has pointers to actual code, that would be appreciated. I
> can't work on it for the next two weeks, but after that, I should be able
> to use a day or two to see what's possible.]
>
> For a static Huffman encoding, you have to decide what symbols you accept
> as input, give every symbol a probability (these have to add up to 1) and
> then you get the 'optimal' "comma-free" encoding using the algorithm
> devised by Huffman. Optimal is under the assumptions that the probabilities
> are correct (and independent) and that you have to use an integral number
> of bits per symbol. Arithmetic coding gets rid of the second restriction,
> to get rid of the first, one creates a more complex model. Comma-free just
> means you don't have to guess where the bits for one symbol end and those
> for the next symbol start.
>
> I'm not sure how the proposals came up with the probabilities, but I guess
> that was based on some data. Even though HTTP headers aren't English, I'd
> assume that 'e' has significantly higher probability than 'Q' :-).
>
> To make sure this works for UTF-8, the easiest way is to take bytes as
> input symbols. Some byte values with the high bit set can't appear in
> UTF-8, so these can be excluded. I'm not sure what the current proposals
> do, but I guess they also exclude many of the byte values in the C0 range
> (0x00-0x1F), or at least give them very low probabilities.
>
> A somewhat more complicated, but more efficient model would take the state
> of the UTF-8 'engine' into account, and have separate Huffman encoders for
> the various states, i.e. first byte of a character, second byte of a
> two-byte character, second byte of a three-byte character, third byte of a
> three-byte character, second/third/fourth byte of a four-byte character.
> Most if not all of the later ones can use the same Huffman encoder, and
> most probably can just assume a flat probability distribution and encode
> six raw bits (i.e. just eliminate the leading 0b10 in trailing UTF-8 bytes).
>
> If we think that ASCII is way more probable than UTF-8, then the only
> thing that will happen to ASCII is that one of the longest bit sequences,
> the one for the least probable ASCII byte, will get one bit longer. Of
> course, if we think that UTF-8 data is more probable, then there will be
> more changes, but these changes would be duly deserved.
>
> Regards,    Martin.
>
>

Received on Sunday, 10 February 2013 05:19:21 UTC