Re: Delta Compression and UTF-8 Header Values

The header names are almost completely handled with the pre-seeded
dictionary, so they really don't affect the character frequency count
and/or thus the huffman encoding.

Arithmetic coding gets better compression ratios, at the expense of gobs of
CPU and complexity. I don't think that is a good tradeoff :/
We're proposing thus far that we encode with the static huffman, and if the
end-result is larger than the original text, just use the original text. Of
course, one could skip the huffman-encoding step if one had a good idea
that this would be the case, but hopefully we get close enough that the
static huffman is still of benefit. The way of doing this selection is
exactly what you propose-- use up a bit to indicate that the encoding isn't
done with huffman. There are a couple obvious ways of doing this:
1) Use a flag in the opcode byte. The main advantage of doing this is that
it saves bits elsewhere, but there is a disadvantage: If you end up wanting
to encode strings in two different ways, you must emit two different
opcodes of the same type, and each opcode ends up consuming 2-bytes (one
for opcode+flags, one for number of operations of that type).

-=R


On Mon, Feb 11, 2013 at 1:34 PM, James Cloos <cloos@jhcloos.com> wrote:

> >>>>> "JMS" == James M Snell <jasnell@gmail.com> writes:
>
> JMS> we'll be able to huffman code anything that is flagged
> JMS> as ASCII, and won't be able to touch the rest.
>
> Would that really be an issue?  The static huffman can only really be
> for the common strings, yes?  Which mostly means the header names and
> not the header values?  So even if the headers were limited to ascii
> the tables wouldn't help much for most of the values?
>
> (As an aside, Would arithmetic be of any better value than huffman, here?)
>
> Using one bit for each string to specify utf8-text blob vs binary blob,
> and using the former for everthing know to be text, seems the best
> overall choice.  And if any non-ascii utf8 sequences become common
> enough, they can be added to future revisions of the static table just
> as easily as 7-bit strings can be.
>
> -JimC
> --
> James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6
>
>

Received on Monday, 11 February 2013 22:53:29 UTC