W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2013

Re: Delta Compression and UTF-8 Header Values

From: Nico Williams <nico@cryptonector.com>
Date: Sun, 10 Feb 2013 02:17:04 -0600
Message-ID: <CAK3OfOhFFHymH1x7t7bAnTEzE34PyWO1moOC5p3opC4qcHzA2Q@mail.gmail.com>
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
On Saturday, February 9, 2013, "Martin J. Dürst" wrote:

> Hello James, others,
>
> On 2013/02/09 4:28, James M Snell wrote:
>
>> One key challenge with allowing UTF-8 values, however, is that it
>> conflicts with the use of the static huffman encoding in the proposed
>> Delta Encoding for header compression. If we allow for non-ascii
>> characters, the static huffman coding simply becomes too inefficient
>> and unmanageable to be useful. There are a few ways around it but none
>> of the strategies are all that attractive.
>
>
Wait, what?  If you have non-English (worse, non-European) text in some
ASCII encoding like punycode, or base64-encoded UTF-8, then static Huffman
will not be useful for compression anyways (assuming Huffman coding is
based on English -say- letter frequencies).


> [If somebody has pointers to actual code, that would be appreciated. I
> can't work on it for the next two weeks, but after that, I should be able
> to use a day or two to see what's possible.]
>
> For a static Huffman encoding, you have to decide what symbols you accept
> as input, give every symbol a probability (these have to add up to 1) and
> then you get the 'optimal' "comma-free" encoding using the algorithm
> devised by Huffman. Optimal is under the assumptions that the probabilities
> are correct (and independent) and that you have to use an integral number
> of bits per symbol. Arithmetic coding gets rid of the second restriction,
> to get rid of the first, one creates a more complex model. Comma-free just
> means you don't have to guess where the bits for one symbol end and those
> for the next symbol start.


Right.  i hope i put it more succintly above.

The fact is that Huffman coding for all our scripts at once just isn't
possible.  Static Huffman coding is not a good reason to not want UTF-8 or
any other Unicode encoding.

Nico
--
Received on Sunday, 10 February 2013 08:17:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 10 February 2013 08:17:31 GMT