Re: [EXI] String Encoding in case of a string table miss from Jaakko Kangasharju on 2008-03-25 (public-exi@w3.org from March 2008)

From: Jaakko Kangasharju <jkangash@hiit.fi>
Date: Tue, 25 Mar 2008 16:15:07 +0200
To: santhana@huawei.com
Cc: public-exi@w3.org
Message-ID: <tt3hceu99g4.fsf@lugburz.hiit.fi>

santhanakrishnan <santhana@huawei.com> writes:

> 	Thanks for your explanation. I could appreciate it well.
> When optimized for frequent use of compact identifiers and when "hit" the
> uri or prefix we encode the compact identifier incremented by 1. When we
> miss we encode the length prefixed string as such.
> When optimized for frequent use of string literals and when "miss" the
> localname or value we encode the length of the string incremented by 1 or 2.
> When we hit we encode 0 or 1 followed by the compact identifier.
> 	But how exactly this encoding helps in the optimization of frequent
> use of compact identifier(uri or prefix) or string literal(localname or
> value) ?

It helps with compactness.  Say that we have the string "hello" and
the string table partition used for it has 10 entries in it.  Now, if
the partition is optimized for frequent use of compact identifiers,
the string will be encoded as follows:
hit:  compact identifier + 1 = 4 bits total
miss: 0 in 4 bits + length in 8 bits + characters = 52 bits total

On the other hand, if the partition is optimized for frequent use of
string literals, the string will be encoded as follows:
hit:  0 (or 1) in 8 bits + compact identifier in 4 bits = 12 bits total
miss: length + 1 (or 2) in 8 bits + characters = 48 bits total

(The 4-bit parts come from the partition having 10 entries, which
requires 4 bits for indexing into it, and the 8-bit parts (including
the characters in the string) from the unsigned integer encoding,
which represents an integer as a variable-length sequence of octets
(here, the integers are small enough to be represented in one octet,
but a sufficiently long string or non-ASCII characters would require
more octets).)

As you can see, the case which each partition is optimized for is
encoded in a smaller number of bits than the other case, and also a
smaller number of bits than it would be if every encoding started with
just an indicator of hit or miss.  The partition types for each kind
of string are selected according to which case is expected to be more
common in documents, so that the smaller encoding is usually used when
it's more appropriate.

-- 
Jaakko Kangasharju, Helsinki University of Technology
Miksi valita pienempi paha?  Cthulhu presidentiksi!

Received on Tuesday, 25 March 2008 14:15:58 UTC