Profile-specific URI/string tables -- Modulo compaction

Hello,

Thanks for the work on EXI, it looks very good!  If I am correct there
are two opportunities for compaction that have been missed.  I hope I am
not repeating earlier discussions, but by all means refer me if I've
failed at finding any.


PRELOADED STRING TABLES:
EXI makes good use of static knowledge such as Schemas.  It is smart
enough to preload a number of URIs and other strings.
The reflective nature of RDF triples stretches the requirements to EXI,
by placing information into URI fields, rather than schema elements.  A
similar thing applies to applications that avoid using attributes but
instead setup reflective attributes holding string values that are
application-interpreted.
Imagine wanting to compress RDF, and use it within an application that
mostly uses a certain set of URIs.  This would ideally preload a
custom-defined set of URIs for that specific application profile.  A
similar thing would apply to string tables.  As far as I could tell,
this is not supported, although it might be practical.

Pros:
* Better compaction.
* Apps do not need to match the URI/string and/or assign dynamic
identifiers to them based on document occurrence order to their fixed
set of used URIs.
* As a result, RDF/EXI profiles can be better equiped for embedded
applications.

Cons:
* More parameterisation of the conversion process (a URI table and/or
string table) similar to Schema preloading.
* Profiles would need to be described / standardised and perhaps recognised.


MODULO COMPACTION:
When a grammar could produce (say) 0, 1.0, 1.1 and 2 the number of bits
for the first term is 2.  This reserves unused space for value 3.  One
might consider multiplying the values following it by 3, and adding the
first value (0, 1 or 2).  Reversing this action would require splitting
the value with DIV and MOD operations, which are often paired.
The amount of following values could be capped off to fit in 32 bits, or
any other practical boundary.  This means that reserved unused space
only occurs once per boundary, instead of multiple times.

Pros:
* Better compaction, ballpark figure 10% to 20% improvement?

Cons:
* Tedious to implement on 8-bit platforms.  Some 32-bit platforms might
only support DIVMOD into 16 bit fragments?
* Even worse readability of binary code.


Again, if these things were discussed and I failed finding them, then I
apologise.  I am only writing in the hope to improve your highly
interesting work!


Cheers,

Rick van Rein
ARPA2.net

Received on Tuesday, 23 June 2015 15:18:10 UTC