- From: James M Snell <jasnell@gmail.com>
- Date: Sat, 9 Feb 2013 22:51:06 -0800
- To: Roberto Peon <grmocg@gmail.com>
- Cc: ietf-http-wg@w3.org
- Message-ID: <CABP7RbdBi5-FBnhumaOJ48ioKFSpoTYhaW6V3NdJHxGGJcfkUg@mail.gmail.com>
When I initially tested it locally with a range of extended characters it blew up on me. I will put together another set of tests and track down the problem. On Feb 9, 2013 9:18 PM, "Roberto Peon" <grmocg@gmail.com> wrote: > The code checked in for the delta encoder either already encodes binary > values, or can be set to do so easily. The c++ version did-- the "unused" > symbols ended up with 32-bit outputs. > -=R > > > On Sat, Feb 9, 2013 at 8:31 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote: > >> Hello James, others, >> >> >> On 2013/02/09 4:28, James M Snell wrote: >> >>> One key challenge with allowing UTF-8 values, however, is that it >>> conflicts with the use of the static huffman encoding in the proposed >>> Delta Encoding for header compression. If we allow for non-ascii >>> characters, the static huffman coding simply becomes too inefficient >>> and unmanageable to be useful. There are a few ways around it but none >>> of the strategies are all that attractive. >>> >> >> [If somebody has pointers to actual code, that would be appreciated. I >> can't work on it for the next two weeks, but after that, I should be able >> to use a day or two to see what's possible.] >> >> For a static Huffman encoding, you have to decide what symbols you accept >> as input, give every symbol a probability (these have to add up to 1) and >> then you get the 'optimal' "comma-free" encoding using the algorithm >> devised by Huffman. Optimal is under the assumptions that the probabilities >> are correct (and independent) and that you have to use an integral number >> of bits per symbol. Arithmetic coding gets rid of the second restriction, >> to get rid of the first, one creates a more complex model. Comma-free just >> means you don't have to guess where the bits for one symbol end and those >> for the next symbol start. >> >> I'm not sure how the proposals came up with the probabilities, but I >> guess that was based on some data. Even though HTTP headers aren't English, >> I'd assume that 'e' has significantly higher probability than 'Q' :-). >> >> To make sure this works for UTF-8, the easiest way is to take bytes as >> input symbols. Some byte values with the high bit set can't appear in >> UTF-8, so these can be excluded. I'm not sure what the current proposals >> do, but I guess they also exclude many of the byte values in the C0 range >> (0x00-0x1F), or at least give them very low probabilities. >> >> A somewhat more complicated, but more efficient model would take the >> state of the UTF-8 'engine' into account, and have separate Huffman >> encoders for the various states, i.e. first byte of a character, second >> byte of a two-byte character, second byte of a three-byte character, third >> byte of a three-byte character, second/third/fourth byte of a four-byte >> character. Most if not all of the later ones can use the same Huffman >> encoder, and most probably can just assume a flat probability distribution >> and encode six raw bits (i.e. just eliminate the leading 0b10 in trailing >> UTF-8 bytes). >> >> If we think that ASCII is way more probable than UTF-8, then the only >> thing that will happen to ASCII is that one of the longest bit sequences, >> the one for the least probable ASCII byte, will get one bit longer. Of >> course, if we think that UTF-8 data is more probable, then there will be >> more changes, but these changes would be duly deserved. >> >> Regards, Martin. >> >> >
Received on Sunday, 10 February 2013 06:51:33 UTC