Re: First cut of Huffman encoding in compression document.

That is probably what would happen.
There is an issue (in github already) about inverting the prefix sense
(i.e. 1->0 and vice versa) so that padding is zeros, however the part one
must be very careful about is that this is not guaranteed to happen for any
entry.
-=R


On Thu, Oct 17, 2013 at 5:38 PM, Fred Akalin <akalin@google.com> wrote:

> Random proposal, not sure if it's feasible.
>
> Would it be possible to explicitly map EOS to '0000000', and therefore we
> can just say to pad with 0s until the next byte boundary?
>
>
> On Thu, Oct 17, 2013 at 5:29 PM, Roberto Peon <grmocg@gmail.com> wrote:
>
>> I know, I could instead state that the padding must be any code >= 7
>> bits, but then implementations have to choose which character is to be used
>> for this, and that might vary based on the frequency table. Bleh.
>>
>> By using EOS, we guarantee that when the table changes, then the symbol
>> one encodes as padding is always the same symbol.
>>
>> Do you want to reopen the issue so we discuss it at Vancouver, or do you
>> find that persuasive?
>>
>> -=R
>>
>>
>> On Thu, Oct 17, 2013 at 5:14 PM, Ilari Liusvaara <
>> ilari.liusvaara@elisanet.fi> wrote:
>>
>>> On Thu, Oct 17, 2013 at 02:32:19PM -0700, Roberto Peon wrote:
>>> > On Thu, Oct 17, 2013 at 2:15 PM, Fred Akalin <akalin@google.com>
>>> wrote:
>>> > >
>>> > > - I may be missing something, but I'm not sure why we need a string
>>> > > literal to be both length-delimited and have an end marker. I'd
>>> prefer just
>>> > > having the length and assigning the short encoding for EOS to
>>> something
>>> > > else.
>>> > >
>>> >
>>> > Since length is in bytes instead of bits, but huffman encoded things
>>> are
>>> > bit based, you need either to represent length in bits (which makes the
>>> > lengths bigger), or you need to ensure that the padding used to get to
>>> the
>>> > next byte boundary at the end of the string cannot be interpreted as a
>>> > valid huffman-code.
>>>
>>> You don't need EOS code for this.
>>>
>>> This is because there are more than 128 source symbols and target
>>> alphabet
>>> is binary, and thus there exists 8+ bit code. Prefix of that can act as
>>> padding, it can't be mixed up with valid symbol.
>>>
>>> -Ilari
>>>
>>
>>
>

Received on Friday, 18 October 2013 00:44:55 UTC