- From: Roberto Peon <grmocg@gmail.com>
- Date: Sat, 9 Feb 2013 21:18:53 -0800
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
- Message-ID: <CAP+FsNc0Owd9eRR91fNEiECb7Kxe=UhZMoKS-DphZ94xYB6KXw@mail.gmail.com>
The code checked in for the delta encoder either already encodes binary values, or can be set to do so easily. The c++ version did-- the "unused" symbols ended up with 32-bit outputs. -=R On Sat, Feb 9, 2013 at 8:31 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote: > Hello James, others, > > > On 2013/02/09 4:28, James M Snell wrote: > >> One key challenge with allowing UTF-8 values, however, is that it >> conflicts with the use of the static huffman encoding in the proposed >> Delta Encoding for header compression. If we allow for non-ascii >> characters, the static huffman coding simply becomes too inefficient >> and unmanageable to be useful. There are a few ways around it but none >> of the strategies are all that attractive. >> > > [If somebody has pointers to actual code, that would be appreciated. I > can't work on it for the next two weeks, but after that, I should be able > to use a day or two to see what's possible.] > > For a static Huffman encoding, you have to decide what symbols you accept > as input, give every symbol a probability (these have to add up to 1) and > then you get the 'optimal' "comma-free" encoding using the algorithm > devised by Huffman. Optimal is under the assumptions that the probabilities > are correct (and independent) and that you have to use an integral number > of bits per symbol. Arithmetic coding gets rid of the second restriction, > to get rid of the first, one creates a more complex model. Comma-free just > means you don't have to guess where the bits for one symbol end and those > for the next symbol start. > > I'm not sure how the proposals came up with the probabilities, but I guess > that was based on some data. Even though HTTP headers aren't English, I'd > assume that 'e' has significantly higher probability than 'Q' :-). > > To make sure this works for UTF-8, the easiest way is to take bytes as > input symbols. Some byte values with the high bit set can't appear in > UTF-8, so these can be excluded. I'm not sure what the current proposals > do, but I guess they also exclude many of the byte values in the C0 range > (0x00-0x1F), or at least give them very low probabilities. > > A somewhat more complicated, but more efficient model would take the state > of the UTF-8 'engine' into account, and have separate Huffman encoders for > the various states, i.e. first byte of a character, second byte of a > two-byte character, second byte of a three-byte character, third byte of a > three-byte character, second/third/fourth byte of a four-byte character. > Most if not all of the later ones can use the same Huffman encoder, and > most probably can just assume a flat probability distribution and encode > six raw bits (i.e. just eliminate the leading 0b10 in trailing UTF-8 bytes). > > If we think that ASCII is way more probable than UTF-8, then the only > thing that will happen to ASCII is that one of the longest bit sequences, > the one for the least probable ASCII byte, will get one bit longer. Of > course, if we think that UTF-8 data is more probable, then there will be > more changes, but these changes would be duly deserved. > > Regards, Martin. > >
Received on Sunday, 10 February 2013 05:19:21 UTC