Re: [EXI] String Encoding in case of a string table miss from Jaakko Kangasharju on 2008-03-25 (public-exi@w3.org from March 2008)

From: Jaakko Kangasharju <jkangash@hiit.fi>
Date: Tue, 25 Mar 2008 11:44:29 +0200
To: santhana@huawei.com
Cc: public-exi@w3.org
Message-ID: <tt3r6dz87eq.fsf@lugburz.hiit.fi>

santhanakrishnan <santhana@huawei.com> writes:

>        In case of a string table miss and value table miss we encode the
> string or value as 
>
> Length prefixed, length incremented by 1 string
>
> Length prefixed, length incremented by 2 string
>
> Can anybody explain the reason why the length of the string or value is
> incremented by 1 or 2.

The encoding needs to indicate in some manner whether there is a
string table hit or miss.  The string encoding always begins with a
non-negative integer that can be used to determine this.  Of the range
of this integer, some values are reserved to only indicate either a
hit or miss, and the rest of the values are used to directly encode
the case that the partition is optimized for.  In the case of the
local-name and value partitions, which you are asking about, the
optimization is for the frequent use of string literals, so the
reserved integer values are for string table hits.

In the local-name case there is only one partition, so only one
integer (0) is required to indicate a string table hit and the rest,
from 1 upwards, are used for string table misses.  As noted, these
other integers are already a part of the encoding of the string, that
is, they denote the encoded string's length.  Since all strings still
need to be representable, 1 must be subtracted from the encoded value
to get it into the range from 0 upwards.  And viewing from the encoder
side, this translates into having to add 1 to the string length when
encoding.

In the value case, there are two partitions, local and global, so two
integer values are needed for string table hits, 0 for local values
and 1 for global values.  Thus the available range to indicate a
string literal length is from 2 upwards, so 2 has to be subtracted to
get it into the range from 0 upwards.

The same applies to the partitions optimized for frequent use of
compact identifiers, except there the reserved value 0 is used to
indicate a string table miss, that is, the 0 is followed by a normal
length-prefixed encoding of a string.  Again, values from 1 upwards
denote compact identifiers, and 1 must be subtracted to get them into
the range from 0 upwards.

Ultimately, the reason for having the partitions optimized for either
hits or misses is to achieve better compactness, since this technique
avoids the use of indicator values in the optimized-for case.

Hope this helps,

-- 
Jaakko Kangasharju, Helsinki University of Technology
Paperi soveltuu vain piirusteluun ja pyyhkimiseen

Received on Tuesday, 25 March 2008 09:45:25 UTC