Re: RDF Thrift : A binary format for RDF data

Hi all,

I'm Javier, from the HDT team. From our own experience, there is an
increasing interest in efficient, binary RDF management. Dataset
backups, transfers between servers or processing nodes, RDF streaming
or self-contained triplestores, are just few examples of real use
cases for which we receive feedback requests.

Certainly, these scenarios have very different requirements, and the
selection of the RDF serialization has to take into account some
parameters such as the serialization compactness, the processing speed
and the retrieval operations, to name but a few of the most important
ones for these cases.

In this sense, I really like to see more works on binary RDF such as
the RDF Apache Thift proposal, putting the focus on the simplicity and
the write/parse speed.

I totally agree with Ruben regarding the differences with HDT (BTW
thanks for all the references): HDT addresses the complementary
problem of providing a highly compressed, indexed binary format,
serving fast retrieval operations. Acknowledging that people is not
massively publishing HDT files, it is also true that it is gaining its
place as a self-contained compressed repository in the way Ruben does,
able to solve queries efficiently and with a reduced memory footprint.
In addition to our C++ and Java libraries managing HDT, one can deploy
HDT files within Jena/Jena Fuseki
(http://www.rdfhdt.org/manual-of-hdt-integration-with-jena/), making
use of all their well-known features.

Besides the very interesting LD fragments proposal, personally, I also
envision a great potential of HDT as a self-contained engine to
retrieve RDF information in mobile devices. We will present a demo at
ISWC'14 in this respect
(http://dataweb.infor.uva.es/wp-content/uploads/2014/08/iswc14.pdf).

Finally, I would like to point to the RDF Stream Processing Community
Group (RSP), in which we have started to look at efficient RDF
serializations, including the binary ones
(https://www.w3.org/community/rsp/wiki/RSP_Serialization_Group). Any
feedback is also welcome!

All the best,

Javier D. Fernández
Postdoc at Sapienza - Università di Roma

On Mon, Aug 18, 2014 at 7:06 PM, Michel Dumontier
<michel.dumontier@gmail.com> wrote:
> On Mon, Aug 18, 2014 at 2:16 AM, Ruben Verborgh <ruben.verborgh@ugent.be> wrote:
>> Hi Andy,
>>
>>> How much is HDT used for real?
>>
>> We use it to enable client-side SPARQL query execution with 99.9% availability.
>> Here is an online demo: http://client.linkeddatafragments.org/.
>>
>> The HDT files are used to run the server at http://data.linkeddatafragments.org/.
>> Details on why HDT is a good format for this are here [1].
>>
>>> By whom?
>>
>> We (Ghent University – iMinds) use it to host high-availability queryable datasets.
>> The software that enables this is available as open source [2],
>> so anybody else can use it to do the same.
>>
>>> I couldn't find HDT files.
>>
>> For the same reason you won't find Virtuoso db files: we use it on the server.
> actually, you can! The Bio2RDF project makes their indexed Virtuoso
> dbs available.
>
> http://download.bio2rdf.org/release/3/
>
> we also provide gzipped nquads, and we'd be interested in providing an
> alternative binary, indexed format.
>
> m.
>
>> As you said, Thrift and HDT have different design goals.
>> Thrift files are meant to be “found“, HDT files not necessarily.
>>
>> BTW you can find HDT files here: http://www.rdfhdt.org/datasets/
>> And the tools to make them yourself: http://www.rdfhdt.org/download/
>>
>> Ruben
>>
>> PS I might be interested to look at a JavaScript/Node.js implementation of Thrift.
>> Are there any plans (or code) in that direction already? Pointers to start?
>>
>> [1] http://linkeddatafragments.org/publications/iswc2014.pdf
>> [2] https://github.com/LinkedDataFragments/
>



-- 
Javier D. Fernández García
jfergar83(at)gmail.com

Received on Tuesday, 19 August 2014 08:31:29 UTC