Re: SPARQL performance for ORDER BY on large datasets

2009/8/27 Niklas Lindström <lindstream@gmail.com>:
> Hi all!
>
> I have a straightforward use case which seems really hard for triple
> stores to perform for using SPARQL (on huge datasets).
>
> The case: select a set of resources (based on simple criteria such as
> type) and order them based on a property value (limiting the results
> to a decent batch size).
>
> Data for test: the Library of Congress RDF dump of subject headings at
> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 621
> statements).
>
> The query:
>
>    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>    PREFIX dct: <http://purl.org/dc/terms/>
>
>    SELECT * WHERE {
>        ?resource a skos:Concept;
>            dct:modified ?modified .
>    }
>    ORDER BY DESC(?modified)
>    LIMIT 100
>
> I tried this with Sesame (native file store), Virtuoso (opensource
> edition), and AllegroGraph (free edition), and got terrible results
> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
> skipping the type match gets it down to about 20 seconds *at best*.
>
> Sure, it's understandable (AFAIK) that this specific case is much
> easier to use an SQL DB for (or something like CouchDB). But I was
> surprised that it was *this* terrible. It seems unfeasible to build
> e.g. chronological feeds from larger RDF stores using SPARQL with
> current tools.
>
>
> Is this -- ORDER BY performance -- a commonly known problem, and
> considered an issue of importance (for academia and implementers
> alike)?
>
> Or am I missing something really obvious in the setup of these stores,
> or in my query? I welcome *any* suggestions, such as "use triple store
> X", "for X, make sure to configure indexing on Y". Or do RDF-using
> service builders in general opt out to indexing in something else
> entirely in these cases?
>
>
> (It seems queries like this are present in the Berlin SPARQL Benchmark
> (e.g. #8), but I haven't analyzed this correlation and possible
> meanings of it in depth.)
>
> Best regards,
> Niklas Lindström
>
>

This is made worse by the fact that ORDER BY is technically required
whenever you have a LIMIT/OFFSET, although the results are just so
slow that I can't include ORDER BY on any of the queries I include
LIMIT/OFFSET for or they would never return results within the 30
second window I have on my application because pretty much every
dataset I use is about that many triples or bigger. (Some are 3 or 4
magnitudes larger)

Cheers,

Peter

Received on Wednesday, 26 August 2009 21:16:39 UTC