- From: Nico Williams <nico@cryptonector.com>
- Date: Sun, 10 Feb 2013 02:17:04 -0600
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
- Message-ID: <CAK3OfOhFFHymH1x7t7bAnTEzE34PyWO1moOC5p3opC4qcHzA2Q@mail.gmail.com>
On Saturday, February 9, 2013, "Martin J. Dürst" wrote: > Hello James, others, > > On 2013/02/09 4:28, James M Snell wrote: > >> One key challenge with allowing UTF-8 values, however, is that it >> conflicts with the use of the static huffman encoding in the proposed >> Delta Encoding for header compression. If we allow for non-ascii >> characters, the static huffman coding simply becomes too inefficient >> and unmanageable to be useful. There are a few ways around it but none >> of the strategies are all that attractive. > > Wait, what? If you have non-English (worse, non-European) text in some ASCII encoding like punycode, or base64-encoded UTF-8, then static Huffman will not be useful for compression anyways (assuming Huffman coding is based on English -say- letter frequencies). > [If somebody has pointers to actual code, that would be appreciated. I > can't work on it for the next two weeks, but after that, I should be able > to use a day or two to see what's possible.] > > For a static Huffman encoding, you have to decide what symbols you accept > as input, give every symbol a probability (these have to add up to 1) and > then you get the 'optimal' "comma-free" encoding using the algorithm > devised by Huffman. Optimal is under the assumptions that the probabilities > are correct (and independent) and that you have to use an integral number > of bits per symbol. Arithmetic coding gets rid of the second restriction, > to get rid of the first, one creates a more complex model. Comma-free just > means you don't have to guess where the bits for one symbol end and those > for the next symbol start. Right. i hope i put it more succintly above. The fact is that Huffman coding for all our scripts at once just isn't possible. Static Huffman coding is not a good reason to not want UTF-8 or any other Unicode encoding. Nico --
Received on Sunday, 10 February 2013 08:17:29 UTC