Re: RDF Thrift : A binary format for RDF data from Andy Seaborne on 2014-08-16 (semantic-web@w3.org from August 2014)

From: Andy Seaborne <andy@apache.org>
Date: Sat, 16 Aug 2014 13:16:18 +0100
To: semantic-web@w3.org
Message-ID: <53EF4B92.7020704@apache.org>
On 16/08/14 07:28, Eric Prud'hommeaux wrote:

> That looks pretty cool. Any idea how this compares with HDT
> <http://www.w3.org/Submission/2011/SUBM-HDT-20110330/> or Sesame's
> binary formats for RDF and SPARQL results?

RDF Thrift is a syntax; for RDF, it's sort of N-Quads, in binary, with 
prefixes.  For SPARQL results, it's a CSV/TSV like table encoding of 
terms, in binary, with prefixes.

HDT solves a different problem - it's a compact distribution mechanism. 
It includes custom compression built-in and includes a data access 
mechanism so it is more like a database.

Writing HDT does a lot of work to facilitate the compression.  Creating 
the dictionary requires seeing the whole of the data, as does allocating 
ids because you need to know if a term is used as subject or 
subject-object so you can't just allocate a number on first encounter. 
With a goal of being read many times, upfront work to make for a high 
compression is a reasonable trade-off.

RDF Thrift is a streaming syntax which is very important working at 
scale (and even moderate scale).  This is true for both RDF 
graphs/datasets and also SPARQL result sets.

RDF Thrift compresses well with gzip if you want to store it - you get 
x8-x10 just like N-Triples, and a bit less if you use prefixes (but the 
raw stream is smaller as well so the compressed size ends up similar).

In fact, if you're using it for point-to-point transfer, that is a 
one-time operation when client and server are near to each other, gzip 
is a bad choice because it's compression stage is expensive.  Snappy, or 
nothing, is better.

RDF Thrift makes that choice orthogonal and, for example, controllable 
via HTTP Accept-Encoding.

One such one-time use is SPARQL result sets.  The standard text formats 
are streaming but reading text formats is slower than binary (you have 
to hunt around for end markers; in some languages that induces an extra 
copy as well; it's CPU cache unfriendly).

Sesame binary format described is for RDF graphs with context and 
presumably the SPARQL result form is similar.  It has a facility for 
dictionary encoding, like RDF HDT but the dictionary is inline, is not 
required, and does not have to at the start.  There are no prefix names; 
prefix declarations are carried with the data.

As a wire encoding, it is more like Thrift TBinaryProtocol, not 
TCompactProtocol.

RDF Thrift prefix rules are more relaxed that Turtle (just concat the 
two parts - no validity rules) so prefixes can be used for tokenizing URIs.

Knowing when to place a term in the dictionary is a problem that does 
not have a single answer.  Dictionaries are state (as are prefixes).  If 
a system always adds a dictionary entry for every RDF term (there is no 
way to unset a dictionary entry), the receiver will end up with a 
dictionary the size of all distinct terms in the data.  Dictionaries 
only grow over the transfer.  That makes it possible for a large server 
to blow up a small client!

Prefixes can also generate the same problem but typically there are only 
a small number, not comparable to the number of terms in the data.

RDF Thrift provides "REPEAT" as a specific dictionary-like term because 
in SPARQL results, and RDF graphs to a large extent, the data is the 
"same as the column in the row/triple/quad as last time".  This requires 
keeping on one slot of state regardless of data size.

The advantages of Apache Thrift are there are lots of implementations 
and it is heavily used and hence tuned. Getting I/O fast is more about 
managing buffering as it is clever formats.  RDF Thrift is only using 
the encoding part of Thrift, not the service model.

At the moment, there is no widespread deployment of the format so 
improvements now are easily done. Things that could be added to RDF 
Thrift are:

Inline number values : For number-rich data, directly including the 
number as a variable length (e.g. Zigzag integers) or binary 64 bit 
floating point numbers saves both on datatype URI, and strign space for 
the lexical form.  These would loose the exact representation, or 
require it be done "long style" if to be kept. 0001 is the same value as +1.

Dictionaries themselves could be added but I'd like to see the facts and 
figures as to whether the extra work on the writing/sending side does 
not impact the raw point-to-point speed and whether the effect on 
whole-system robustness and state cost of dictionaries is acceptable.

And finally, Apache Thift, the RDF Thrift design and the implementation 
for Jena of RDF Thrift, uses the Apache License, so it's business 
friendly, with a license that covers both copyright and IP matters.

 Andy
Received on Saturday, 16 August 2014 12:16:48 UTC