RE: SPARQL performance for ORDER BY on large datasets from Seaborne, Andy on 2009-08-26 (semantic-web@w3.org from August 2009)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 26 Aug 2009 22:11:13 +0000
To: Niklas Lindström <lindstream@gmail.com>, Semantic Web <semantic-web@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA3693CC70798@GVW1118EXC.americas.hpqcorp.net>
I tried your query and data with Jena/ARQ/TDB on a 32 bit machine with 1G of RAM devoted to the query task. 
Even ice cold (i.e. from system reboot) the worse I can get the system to perform at is 99 seconds rather than minutes.  I would not expect Jena/TDB to be fundamentally faster than any of the systems you mention in this situation.

In more a normal setup, 28% time goes on the sort of datetime information (20seconds), the rest is the pattern matching and reading data into the JVM and the query times are around 55s from fairly cold (new JVM, some filesystem caching which I guess is the 99s->55s).  Second runs in the same JVM are faster still but the sort time is the same.  There is no query results caching. Without the ORDER BY, it takes 0.02s to do the query with LIMIT 100.

This is on Windows Vista / 32 bit Java on consumer-grade hardware, 7200rpm disk; not a portable, nor server-class hardware.  Java 1.6.0_15; TDB from development SVN.  What is your system setup?

I tried a 64 bit machine as well and it's faster in the pattern matching (27s, 47s overall) but the sort speed only decreases to a little below 20s (18s) which is just it's a faster CPU machine.  

What is happening is that to do the ORDER BY, it has to retrieve all the 342684 possibilities so the ORDER BY affacts the pattern matching part and incurs the sort cost.

Jena/TDB does not assume the dateTime formatting is legal and it checks the xsd:dateTime for correctness to make the sorting strict SPARQL; it also uses Java's sort routines.  A system that yielded results from pattern matching in an order that is dateTime sorted would be much faster (around the 0.02s) but it's a tradeoff of generality and data validity assumptions.


Talking of data validity - in the data, all the timezones, regardless of time of year, are -04:00, which is a strangeness.  The data was generated June 2009.  Hmm.

e.g. "2001-01-29T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>

 Andy

> -----Original Message-----
> From: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] On
> Behalf Of Niklas Lindström
> Sent: 26 August 2009 19:16
> To: Semantic Web
> Subject: SPARQL performance for ORDER BY on large datasets
> 
> Hi all!
> 
> I have a straightforward use case which seems really hard for triple
> stores to perform for using SPARQL (on huge datasets).
> 
> The case: select a set of resources (based on simple criteria such as
> type) and order them based on a property value (limiting the results
> to a decent batch size).
> 
> Data for test: the Library of Congress RDF dump of subject headings at
> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 621
> statements).
> 
> The query:
> 
>     PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>     PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>     PREFIX dct: <http://purl.org/dc/terms/>
> 
>     SELECT * WHERE {
>         ?resource a skos:Concept;
>             dct:modified ?modified .
>     }
>     ORDER BY DESC(?modified)
>     LIMIT 100
> 
> I tried this with Sesame (native file store), Virtuoso (opensource
> edition), and AllegroGraph (free edition), and got terrible results
> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
> skipping the type match gets it down to about 20 seconds *at best*.
> 
> Sure, it's understandable (AFAIK) that this specific case is much
> easier to use an SQL DB for (or something like CouchDB). But I was
> surprised that it was *this* terrible. It seems unfeasible to build
> e.g. chronological feeds from larger RDF stores using SPARQL with
> current tools.
> 
> 
> Is this -- ORDER BY performance -- a commonly known problem, and
> considered an issue of importance (for academia and implementers
> alike)?
> 
> Or am I missing something really obvious in the setup of these stores,
> or in my query? I welcome *any* suggestions, such as "use triple store
> X", "for X, make sure to configure indexing on Y". Or do RDF-using
> service builders in general opt out to indexing in something else
> entirely in these cases?
> 
> 
> (It seems queries like this are present in the Berlin SPARQL Benchmark
> (e.g. #8), but I haven't analyzed this correlation and possible
> meanings of it in depth.)
> 
> Best regards,
> Niklas Lindström
Received on Wednesday, 26 August 2009 22:12:46 UTC