RE: Profile-specific URI/string tables -- Modulo compaction from Takuki Kamiya on 2015-07-23 (public-exi-comments@w3.org from July 2015)

From: Takuki Kamiya <tkamiya@us.fujitsu.com>
Date: Thu, 23 Jul 2015 14:44:58 -0700
To: Rick van Rein <rick@openfortress.nl>, "public-exi-comments@w3.org" <public-exi-comments@w3.org>
Message-ID: <23204FACB677D84EBD57175AB7B5A71C02A7FDBC261A@FMSAMAIL.fmsa.local>
Hi Rick,

Thank you for your comments.

With regards to the opportunity "PRELOADED STRING TABLES" we do see the 
rational of having shared string values between sender and receiver. This is 
something we plan to re-consider for the next version of EXI.

One simple trick that may be useful already now is to make use of enumerated 
values.

Note: this is not what you were asking for but may be useful also.

Let's suppose you have to represent the brand name of a car.
One way to do so is to type the brand name as xs:string.

Another way could be to type it as enumerated value.

    <xs:element name="brandName">
        <xs:simpleType>
            <xs:restriction base="xs:string">
                <xs:enumeration value="Audi"/>
                <xs:enumeration value="Golf"/>
                <xs:enumeration value="BMW"/>
            </xs:restriction>
        </xs:simpleType>
    </xs:element>

In the case the actual brand is part of the enumerated values, the value is 
encoded as n-bit Unsigned Integer [1]. If this is not successful one can 
still use schema-invalid productions.

As for MODULO COMPACTION, Using the following event codes as an example, 
I calculated the numbers in a manner you suggested.

  0      0                 =  0
  1.0    1 + 3 * 0         =  1
  1.1    1 + 3 * 1         =  4
  2.0    2 + 3 * 0         =  2
  2.1    2 + 3 * 1         =  5
  2.2    2 + 3 * 2         =  8
  2.3    2 + 3 * 3         = 11
  2.4.0  2 + 3 * 4 + 8 * 0 = 14
  2.4.1  2 + 3 * 4 + 8 * 1 = 22
  2.4.2  2 + 3 * 4 + 8 * 2 = 30
  2.5    2 + 3 * 5         = 17
  2.6    2 + 3 * 6         = 20
  2.7.0  2 + 3 * 7 + 8 * 0 = 23
  2.7.1  2 + 3 * 7 + 8 * 1 = 31

In order to differentiate values ranging from 0 to 31, we need 5 bits.
Then it means it even requires 5 bits for representing the most
likely occurrence of event code 0 in 5 bits instead of 2 bits. Similarly, 
1.0 takes 5 bits instead of 3 bits.

Besides this observation, there is expected some slightly extra computation 
involved to do mod and div operation, which is an additional cost.

Hope this helps,

Takuki Kamiya

[1] http://www.w3.org/TR/exi/#encodingEnumerations


-----Original Message-----
From: Rick van Rein [mailto:rick@openfortress.nl] 
Sent: Tuesday, June 23, 2015 8:18 AM
To: public-exi-comments@w3.org
Subject: Profile-specific URI/string tables -- Modulo compaction

Hello,

Thanks for the work on EXI, it looks very good!  If I am correct there
are two opportunities for compaction that have been missed.  I hope I am
not repeating earlier discussions, but by all means refer me if I've
failed at finding any.


PRELOADED STRING TABLES:
EXI makes good use of static knowledge such as Schemas.  It is smart
enough to preload a number of URIs and other strings.
The reflective nature of RDF triples stretches the requirements to EXI,
by placing information into URI fields, rather than schema elements.  A
similar thing applies to applications that avoid using attributes but
instead setup reflective attributes holding string values that are
application-interpreted.
Imagine wanting to compress RDF, and use it within an application that
mostly uses a certain set of URIs.  This would ideally preload a
custom-defined set of URIs for that specific application profile.  A
similar thing would apply to string tables.  As far as I could tell,
this is not supported, although it might be practical.

Pros:
* Better compaction.
* Apps do not need to match the URI/string and/or assign dynamic
identifiers to them based on document occurrence order to their fixed
set of used URIs.
* As a result, RDF/EXI profiles can be better equiped for embedded
applications.

Cons:
* More parameterisation of the conversion process (a URI table and/or
string table) similar to Schema preloading.
* Profiles would need to be described / standardised and perhaps recognised.


MODULO COMPACTION:
When a grammar could produce (say) 0, 1.0, 1.1 and 2 the number of bits
for the first term is 2.  This reserves unused space for value 3.  One
might consider multiplying the values following it by 3, and adding the
first value (0, 1 or 2).  Reversing this action would require splitting
the value with DIV and MOD operations, which are often paired.
The amount of following values could be capped off to fit in 32 bits, or
any other practical boundary.  This means that reserved unused space
only occurs once per boundary, instead of multiple times.

Pros:
* Better compaction, ballpark figure 10% to 20% improvement?

Cons:
* Tedious to implement on 8-bit platforms.  Some 32-bit platforms might
only support DIVMOD into 16 bit fragments?
* Even worse readability of binary code.


Again, if these things were discussed and I failed finding them, then I
apologise.  I am only writing in the hope to improve your highly
interesting work!


Cheers,

Rick van Rein
ARPA2.net
Received on Thursday, 23 July 2015 21:45:54 UTC