- From: Dan Brickley <danbri@danbri.org>
- Date: Wed, 26 Aug 2009 20:30:34 +0200
- To: public-sparql-dev@w3.org
In case anyone missed this query on the semantic-web list... Dan ---------- Forwarded message ---------- From: Niklas Lindström <lindstream@gmail.com> Date: 2009/8/26 Subject: SPARQL performance for ORDER BY on large datasets To: Semantic Web <semantic-web@w3.org> Hi all! I have a straightforward use case which seems really hard for triple stores to perform for using SPARQL (on huge datasets). The case: select a set of resources (based on simple criteria such as type) and order them based on a property value (limiting the results to a decent batch size). Data for test: the Library of Congress RDF dump of subject headings at <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 621 statements). The query: PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX dct: <http://purl.org/dc/terms/> SELECT * WHERE { ?resource a skos:Concept; dct:modified ?modified . } ORDER BY DESC(?modified) LIMIT 100 I tried this with Sesame (native file store), Virtuoso (opensource edition), and AllegroGraph (free edition), and got terrible results (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting; skipping the type match gets it down to about 20 seconds *at best*. Sure, it's understandable (AFAIK) that this specific case is much easier to use an SQL DB for (or something like CouchDB). But I was surprised that it was *this* terrible. It seems unfeasible to build e.g. chronological feeds from larger RDF stores using SPARQL with current tools. Is this -- ORDER BY performance -- a commonly known problem, and considered an issue of importance (for academia and implementers alike)? Or am I missing something really obvious in the setup of these stores, or in my query? I welcome *any* suggestions, such as "use triple store X", "for X, make sure to configure indexing on Y". Or do RDF-using service builders in general opt out to indexing in something else entirely in these cases? (It seems queries like this are present in the Berlin SPARQL Benchmark (e.g. #8), but I haven't analyzed this correlation and possible meanings of it in depth.) Best regards, Niklas Lindström
Received on Wednesday, 26 August 2009 18:31:09 UTC