- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Thu, 27 Aug 2009 07:16:04 +1000
- To: Niklas Lindström <lindstream@gmail.com>
- Cc: Semantic Web <semantic-web@w3.org>
2009/8/27 Niklas Lindström <lindstream@gmail.com>: > Hi all! > > I have a straightforward use case which seems really hard for triple > stores to perform for using SPARQL (on huge datasets). > > The case: select a set of resources (based on simple criteria such as > type) and order them based on a property value (limiting the results > to a decent batch size). > > Data for test: the Library of Congress RDF dump of subject headings at > <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 621 > statements). > > The query: > > PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> > PREFIX skos: <http://www.w3.org/2004/02/skos/core#> > PREFIX dct: <http://purl.org/dc/terms/> > > SELECT * WHERE { > ?resource a skos:Concept; > dct:modified ?modified . > } > ORDER BY DESC(?modified) > LIMIT 100 > > I tried this with Sesame (native file store), Virtuoso (opensource > edition), and AllegroGraph (free edition), and got terrible results > (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting; > skipping the type match gets it down to about 20 seconds *at best*. > > Sure, it's understandable (AFAIK) that this specific case is much > easier to use an SQL DB for (or something like CouchDB). But I was > surprised that it was *this* terrible. It seems unfeasible to build > e.g. chronological feeds from larger RDF stores using SPARQL with > current tools. > > > Is this -- ORDER BY performance -- a commonly known problem, and > considered an issue of importance (for academia and implementers > alike)? > > Or am I missing something really obvious in the setup of these stores, > or in my query? I welcome *any* suggestions, such as "use triple store > X", "for X, make sure to configure indexing on Y". Or do RDF-using > service builders in general opt out to indexing in something else > entirely in these cases? > > > (It seems queries like this are present in the Berlin SPARQL Benchmark > (e.g. #8), but I haven't analyzed this correlation and possible > meanings of it in depth.) > > Best regards, > Niklas Lindström > > This is made worse by the fact that ORDER BY is technically required whenever you have a LIMIT/OFFSET, although the results are just so slow that I can't include ORDER BY on any of the queries I include LIMIT/OFFSET for or they would never return results within the 30 second window I have on my application because pretty much every dataset I use is about that many triples or bigger. (Some are 3 or 4 magnitudes larger) Cheers, Peter
Received on Wednesday, 26 August 2009 21:16:39 UTC