Re: SPARQL performance for ORDER BY on large datasets from Graham Klyne on 2009-09-06 (semantic-web@w3.org from September 2009)

From: Graham Klyne <GK-lists@ninebynine.org>
Date: Sun, 06 Sep 2009 11:40:46 +0100
To: Semantic Web <semantic-web@w3.org>
CC: "Seaborne, Andy" <andy.seaborne@hp.com>
Message-ID: <4AA391AE.9070701@ninebynine.org>

Seaborne, Andy wrote:
>> I would expect a more RDF-centric way where I can define indexes on
>> subsets of triples, e.g. grouped by properties, etc. Would this be
>> possible to implement for, let's say, Jena/TDB?
> 
> Yes - it's possible to implement.  It's something I have wanted to do for a while now.

This is something I've been thinking about recently.  I think it's do-able 
reasonably easily - i.e. with very modest enhancements to existing code.

<background>
By way of background, I've been using SPARQLite 
(http://code.google.com/p/sparqlite/ - which is based on Andy's ARQ and LARQ 
packages) to support a search and browse application over data from 4 diverse 
data sources.  The triple store is based on TDB and comes in at somewhere around 
10 million triples.  We have managed to avoid writing any new code specific to 
our project for the runtime system: this is important to us for sustainability 
reasons.

We are, in part, replicating functionality that is already provided for one of 
the data sources in isolation using a relational database, so the performance 
bottlenecks of using a triple store compared with RDB are quite starkly exposed. 
  Many queries work really well, but others, mainly involving some kind of 
ordering, are not providing the performance we need.
</background>

Based on study of our running system, I am confident that a modest enhancement 
to the LARQ component could deliver performance for our application.  Currently, 
as provided, LARQ supports a single index linked to the 'pf:textMatch' property 
in SPARQL queries.  All the machinery for linking properties in SPARQL queries 
is present in ARQ/LARQ, but needs new application code to exploit.  Not provided 
is a facility to configure multiple such properties and link them to different 
Lucene indexes.  That development is one that I'd like to use to boost our 
performance, and one that I think is relatively easy to implement.

This approach doesn't automatically speed up arbitrary queries, but I think it 
will allow us to design queries that perform well to extract any required 
information for the end-user application.  I'd count that as a very big win for 
  modest effort.

SO: has anyone done anything like this, which can be used out-of-the box?

(I did notice the message about Parliamemnt, which seems to do something like 
this, but I'm concerned about the overhead and learning curve of switching to a 
different back-end.  And it's not immediately obvious to me if it supports 
free-text queries, which we use extensively.  A solution based on ARQ/LARQ would 
be favourite for me.)

#g

Received on Sunday, 6 September 2009 10:42:47 UTC