Re: Optimizing Huffman inputs

On 13 April 2014 02:44, Amos Jeffries <squid3@treenet.co.nz> wrote:
> Therefore selectively URL-encoding characters whose huffman
> representation is over 17 bits in length will result in a compression
> savings.
> Question is whether a decoder can be written with a special case to also
> perform url-decode when locating the % symbol and still be fast overall ?


That would work, but not as well as what I did, which (as Ilari
calculates) is close to ideal, less the partial bit-per-symbol loss in
comparison to using more efficient codings.  Most of the 17-bit
sequences are illegal anyway.

(You can run the experiment yourself by adding one of the high-valued
octets to the set of valid "characters".  I'm using Unicode code
points for octet values so that they fit in strings.  You'll see that
the encoder avoids using the octet as much as possible, in favour of
'0', '1', and other values with shorter encodings.)

> Secondary question; What if the Huffman symbols were further optimized
> to reduce the bit length of the URL-encoding characters, and in
> particular the %.

See above.  Yes it would make it more feasible to %-encode, but it
would also have a negative impact on existing strings.  There are more
existing strings, so it would have a negative net effect.

Received on Sunday, 13 April 2014 22:16:29 UTC