Re: Questions about RDF.rb and RDF::Sesame from Arto Bendiken on 2010-10-30 (public-rdf-ruby@w3.org from October 2010)

From: Arto Bendiken <arto@datagraph.org>
Date: Sat, 30 Oct 2010 15:58:00 +0200
To: Riccardo Giomi <giomi@netseven.it>
Cc: public-rdf-ruby@w3.org
Message-ID: <AANLkTin6u844fa6kLo3-J3uKY=f+QohyHwLjxmvxPrbE@mail.gmail.com>

Hi Riccardo,

On Sat, Oct 30, 2010 at 1:23 PM, Riccardo Giomi <giomi@netseven.it> wrote:
> Hi,
> I have been toying with the idea of transitioning from activeRDF to
> RDF.rb (using Sesame) in a project I'm working on with my company. I'm
> positively impressed by both code and community, so far. I'd have one
> question though:
>
> in rdf.rubyforge.org/sesame, under limitations, it says: "not yet
> optimized for RDF.rb 0.2.x's bulk-operation APIs". I could not find
> anything about such an API in the RDF.rb code, though. What does "bulk
> operations" mean? I was looking for optimized operations, mostly to
> write and delete big graphs, and considering how slow Sesame usually
> is.

It just means that the current RDF::Sesame::Repository [1]
implementation will insert statements into Sesame one at a time. This
is suboptimal for loading a large dataset into Sesame using RDF.rb,
hence the warning in the README.

The RDF::Sesame implementation as it is today works fine for querying
Sesame once you have already imported your dataset into Sesame by
other means. But loading, say, 100K triples into Sesame using
RDF::Sesame::Repository.load(file) would currently actually make 100K
requests to Sesame - not something you want to do unless you have a
coffee break coming up.

Now, this could be significantly improved by having RDF::Sesame
implement RDF.rb's bulk-operations API, which means having the
RDF::Sesame::Repository class override and implement the
RDF::Repository#insert_statements method instead of just
#insert_statement (notice the plural in the former).

The implementation of #insert_statements [2] should accept an
arbitrary-length RDF::Enumerable as its argument, and then iterate
through the given statements, buffering up some reasonable amount of
statements before issuing a new Sesame request; for instance, it could
insert 5,000 statements at a time, which would mean that inserting
100K statements would take only 20 requests to Sesame instead of the
100K requests currently required.

We're not ourselves actively using or developing RDF::Sesame much at
present, as we ended up developing our own custom RDF storage solution
instead. But if you'd like to improve RDF::Sesame on this front,
contributed features and bug fixes are certainly very welcome -
particularly so in the form of easy-to-merge GitHub pull requests.

Best regards,
Arto

[1] http://rdf.rubyforge.org/sesame/RDF/Sesame/Repository.html
[2] http://rdf.rubyforge.org/RDF/Writable.html#insert_statements-instance_method

PS. For RDF.rb-related work published as open source, I'd also be
happy to provide you with a discounted consulting rate in case you
need any RDF::Sesame particulars improved.

-- 
Arto Bendiken | @bendiken @datagraph

Received on Saturday, 30 October 2010 15:51:13 UTC