Fwd: SPARQL performance for ORDER BY on large datasets from Dan Brickley on 2009-08-26 (public-sparql-dev@w3.org from July to September 2009)

From: Dan Brickley <danbri@danbri.org>
Date: Wed, 26 Aug 2009 20:30:34 +0200
To: public-sparql-dev@w3.org
Message-ID: <eb19f3360908261130w236e4846vabfac1b396fea670@mail.gmail.com>

In case anyone missed this query on the semantic-web list...

Dan

---------- Forwarded message ----------
From: Niklas Lindström <lindstream@gmail.com>
Date: 2009/8/26
Subject: SPARQL performance for ORDER BY on large datasets
To: Semantic Web <semantic-web@w3.org>

Hi all!

I have a straightforward use case which seems really hard for triple
stores to perform for using SPARQL (on huge datasets).

The case: select a set of resources (based on simple criteria such as
type) and order them based on a property value (limiting the results
to a decent batch size).

Data for test: the Library of Congress RDF dump of subject headings at
<http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 621
statements).

The query:

   PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
   PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
   PREFIX dct: <http://purl.org/dc/terms/>

   SELECT * WHERE {
       ?resource a skos:Concept;
           dct:modified ?modified .
   }
   ORDER BY DESC(?modified)
   LIMIT 100

I tried this with Sesame (native file store), Virtuoso (opensource
edition), and AllegroGraph (free edition), and got terrible results
(on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
skipping the type match gets it down to about 20 seconds *at best*.

Sure, it's understandable (AFAIK) that this specific case is much
easier to use an SQL DB for (or something like CouchDB). But I was
surprised that it was *this* terrible. It seems unfeasible to build
e.g. chronological feeds from larger RDF stores using SPARQL with
current tools.

Is this -- ORDER BY performance -- a commonly known problem, and
considered an issue of importance (for academia and implementers
alike)?

Or am I missing something really obvious in the setup of these stores,
or in my query? I welcome *any* suggestions, such as "use triple store
X", "for X, make sure to configure indexing on Y". Or do RDF-using
service builders in general opt out to indexing in something else
entirely in these cases?

(It seems queries like this are present in the Berlin SPARQL Benchmark
(e.g. #8), but I haven't analyzed this correlation and possible
meanings of it in depth.)

Best regards,
Niklas Lindström

Received on Wednesday, 26 August 2009 18:31:09 UTC