Re: Delta Compression and UTF-8 Header Values

When I initially tested it locally with a range of extended characters it
blew up on me. I will put together another set of tests and track down the
problem.
On Feb 9, 2013 9:18 PM, "Roberto Peon" <grmocg@gmail.com> wrote:

> The code checked in for the delta encoder either already encodes binary
> values, or can be set to do so easily. The c++ version did-- the "unused"
> symbols ended up with 32-bit outputs.
> -=R
>
>
> On Sat, Feb 9, 2013 at 8:31 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote:
>
>> Hello James, others,
>>
>>
>> On 2013/02/09 4:28, James M Snell wrote:
>>
>>> One key challenge with allowing UTF-8 values, however, is that it
>>> conflicts with the use of the static huffman encoding in the proposed
>>> Delta Encoding for header compression. If we allow for non-ascii
>>> characters, the static huffman coding simply becomes too inefficient
>>> and unmanageable to be useful. There are a few ways around it but none
>>> of the strategies are all that attractive.
>>>
>>
>> [If somebody has pointers to actual code, that would be appreciated. I
>> can't work on it for the next two weeks, but after that, I should be able
>> to use a day or two to see what's possible.]
>>
>> For a static Huffman encoding, you have to decide what symbols you accept
>> as input, give every symbol a probability (these have to add up to 1) and
>> then you get the 'optimal' "comma-free" encoding using the algorithm
>> devised by Huffman. Optimal is under the assumptions that the probabilities
>> are correct (and independent) and that you have to use an integral number
>> of bits per symbol. Arithmetic coding gets rid of the second restriction,
>> to get rid of the first, one creates a more complex model. Comma-free just
>> means you don't have to guess where the bits for one symbol end and those
>> for the next symbol start.
>>
>> I'm not sure how the proposals came up with the probabilities, but I
>> guess that was based on some data. Even though HTTP headers aren't English,
>> I'd assume that 'e' has significantly higher probability than 'Q' :-).
>>
>> To make sure this works for UTF-8, the easiest way is to take bytes as
>> input symbols. Some byte values with the high bit set can't appear in
>> UTF-8, so these can be excluded. I'm not sure what the current proposals
>> do, but I guess they also exclude many of the byte values in the C0 range
>> (0x00-0x1F), or at least give them very low probabilities.
>>
>> A somewhat more complicated, but more efficient model would take the
>> state of the UTF-8 'engine' into account, and have separate Huffman
>> encoders for the various states, i.e. first byte of a character, second
>> byte of a two-byte character, second byte of a three-byte character, third
>> byte of a three-byte character, second/third/fourth byte of a four-byte
>> character. Most if not all of the later ones can use the same Huffman
>> encoder, and most probably can just assume a flat probability distribution
>> and encode six raw bits (i.e. just eliminate the leading 0b10 in trailing
>> UTF-8 bytes).
>>
>> If we think that ASCII is way more probable than UTF-8, then the only
>> thing that will happen to ASCII is that one of the longest bit sequences,
>> the one for the least probable ASCII byte, will get one bit longer. Of
>> course, if we think that UTF-8 data is more probable, then there will be
>> more changes, but these changes would be duly deserved.
>>
>> Regards,    Martin.
>>
>>
>

Received on Sunday, 10 February 2013 06:51:33 UTC