- From: Andy Seaborne <andy@apache.org>
- Date: Sat, 16 Aug 2014 13:16:18 +0100
- To: semantic-web@w3.org
On 16/08/14 07:28, Eric Prud'hommeaux wrote: > That looks pretty cool. Any idea how this compares with HDT > <http://www.w3.org/Submission/2011/SUBM-HDT-20110330/> or Sesame's > binary formats for RDF and SPARQL results? RDF Thrift is a syntax; for RDF, it's sort of N-Quads, in binary, with prefixes. For SPARQL results, it's a CSV/TSV like table encoding of terms, in binary, with prefixes. HDT solves a different problem - it's a compact distribution mechanism. It includes custom compression built-in and includes a data access mechanism so it is more like a database. Writing HDT does a lot of work to facilitate the compression. Creating the dictionary requires seeing the whole of the data, as does allocating ids because you need to know if a term is used as subject or subject-object so you can't just allocate a number on first encounter. With a goal of being read many times, upfront work to make for a high compression is a reasonable trade-off. RDF Thrift is a streaming syntax which is very important working at scale (and even moderate scale). This is true for both RDF graphs/datasets and also SPARQL result sets. RDF Thrift compresses well with gzip if you want to store it - you get x8-x10 just like N-Triples, and a bit less if you use prefixes (but the raw stream is smaller as well so the compressed size ends up similar). In fact, if you're using it for point-to-point transfer, that is a one-time operation when client and server are near to each other, gzip is a bad choice because it's compression stage is expensive. Snappy, or nothing, is better. RDF Thrift makes that choice orthogonal and, for example, controllable via HTTP Accept-Encoding. One such one-time use is SPARQL result sets. The standard text formats are streaming but reading text formats is slower than binary (you have to hunt around for end markers; in some languages that induces an extra copy as well; it's CPU cache unfriendly). Sesame binary format described is for RDF graphs with context and presumably the SPARQL result form is similar. It has a facility for dictionary encoding, like RDF HDT but the dictionary is inline, is not required, and does not have to at the start. There are no prefix names; prefix declarations are carried with the data. As a wire encoding, it is more like Thrift TBinaryProtocol, not TCompactProtocol. RDF Thrift prefix rules are more relaxed that Turtle (just concat the two parts - no validity rules) so prefixes can be used for tokenizing URIs. Knowing when to place a term in the dictionary is a problem that does not have a single answer. Dictionaries are state (as are prefixes). If a system always adds a dictionary entry for every RDF term (there is no way to unset a dictionary entry), the receiver will end up with a dictionary the size of all distinct terms in the data. Dictionaries only grow over the transfer. That makes it possible for a large server to blow up a small client! Prefixes can also generate the same problem but typically there are only a small number, not comparable to the number of terms in the data. RDF Thrift provides "REPEAT" as a specific dictionary-like term because in SPARQL results, and RDF graphs to a large extent, the data is the "same as the column in the row/triple/quad as last time". This requires keeping on one slot of state regardless of data size. The advantages of Apache Thrift are there are lots of implementations and it is heavily used and hence tuned. Getting I/O fast is more about managing buffering as it is clever formats. RDF Thrift is only using the encoding part of Thrift, not the service model. At the moment, there is no widespread deployment of the format so improvements now are easily done. Things that could be added to RDF Thrift are: Inline number values : For number-rich data, directly including the number as a variable length (e.g. Zigzag integers) or binary 64 bit floating point numbers saves both on datatype URI, and strign space for the lexical form. These would loose the exact representation, or require it be done "long style" if to be kept. 0001 is the same value as +1. Dictionaries themselves could be added but I'd like to see the facts and figures as to whether the extra work on the writing/sending side does not impact the raw point-to-point speed and whether the effect on whole-system robustness and state cost of dictionaries is acceptable. And finally, Apache Thift, the RDF Thrift design and the implementation for Jena of RDF Thrift, uses the Apache License, so it's business friendly, with a license that covers both copyright and IP matters. Andy
Received on Saturday, 16 August 2014 12:16:48 UTC