Re: Delta Compression and UTF-8 Header Values

You'll want to check that the table is full for entries above 127, and that
the EOF terminal is 256 as opposed to 128. I've gone back and forth so many
times on that part... :)
-=R


On Sat, Feb 9, 2013 at 10:51 PM, James M Snell <jasnell@gmail.com> wrote:

> When I initially tested it locally with a range of extended characters it
> blew up on me. I will put together another set of tests and track down the
> problem.
> On Feb 9, 2013 9:18 PM, "Roberto Peon" <grmocg@gmail.com> wrote:
>
>> The code checked in for the delta encoder either already encodes binary
>> values, or can be set to do so easily. The c++ version did-- the "unused"
>> symbols ended up with 32-bit outputs.
>> -=R
>>
>>
>> On Sat, Feb 9, 2013 at 8:31 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp
>> > wrote:
>>
>>> Hello James, others,
>>>
>>>
>>> On 2013/02/09 4:28, James M Snell wrote:
>>>
>>>> One key challenge with allowing UTF-8 values, however, is that it
>>>> conflicts with the use of the static huffman encoding in the proposed
>>>> Delta Encoding for header compression. If we allow for non-ascii
>>>> characters, the static huffman coding simply becomes too inefficient
>>>> and unmanageable to be useful. There are a few ways around it but none
>>>> of the strategies are all that attractive.
>>>>
>>>
>>> [If somebody has pointers to actual code, that would be appreciated. I
>>> can't work on it for the next two weeks, but after that, I should be able
>>> to use a day or two to see what's possible.]
>>>
>>> For a static Huffman encoding, you have to decide what symbols you
>>> accept as input, give every symbol a probability (these have to add up to
>>> 1) and then you get the 'optimal' "comma-free" encoding using the algorithm
>>> devised by Huffman. Optimal is under the assumptions that the probabilities
>>> are correct (and independent) and that you have to use an integral number
>>> of bits per symbol. Arithmetic coding gets rid of the second restriction,
>>> to get rid of the first, one creates a more complex model. Comma-free just
>>> means you don't have to guess where the bits for one symbol end and those
>>> for the next symbol start.
>>>
>>> I'm not sure how the proposals came up with the probabilities, but I
>>> guess that was based on some data. Even though HTTP headers aren't English,
>>> I'd assume that 'e' has significantly higher probability than 'Q' :-).
>>>
>>> To make sure this works for UTF-8, the easiest way is to take bytes as
>>> input symbols. Some byte values with the high bit set can't appear in
>>> UTF-8, so these can be excluded. I'm not sure what the current proposals
>>> do, but I guess they also exclude many of the byte values in the C0 range
>>> (0x00-0x1F), or at least give them very low probabilities.
>>>
>>> A somewhat more complicated, but more efficient model would take the
>>> state of the UTF-8 'engine' into account, and have separate Huffman
>>> encoders for the various states, i.e. first byte of a character, second
>>> byte of a two-byte character, second byte of a three-byte character, third
>>> byte of a three-byte character, second/third/fourth byte of a four-byte
>>> character. Most if not all of the later ones can use the same Huffman
>>> encoder, and most probably can just assume a flat probability distribution
>>> and encode six raw bits (i.e. just eliminate the leading 0b10 in trailing
>>> UTF-8 bytes).
>>>
>>> If we think that ASCII is way more probable than UTF-8, then the only
>>> thing that will happen to ASCII is that one of the longest bit sequences,
>>> the one for the least probable ASCII byte, will get one bit longer. Of
>>> course, if we think that UTF-8 data is more probable, then there will be
>>> more changes, but these changes would be duly deserved.
>>>
>>> Regards,    Martin.
>>>
>>>
>>

Received on Sunday, 10 February 2013 07:03:09 UTC