RE: SPARQL performance for ORDER BY on large datasets from Benjamin Nowack on 2009-08-27 (semantic-web@w3.org from August 2009)

From: Benjamin Nowack <bnowack@semsol.com>
Date: Thu, 27 Aug 2009 21:26:19 +0200
To: "Niklas Lindström" <lindstream@gmail.com>
Cc: "Seaborne, Andy" <andy.seaborne@hp.com>, Semantic Web <semantic-web@w3.org>
Message-ID: <PM-GA.20090827212619.A0966.3.1D@semsol.com>
Hi,

I "just" (it took 2 hours ;) loaded the LCSH into ARC2 and tried the query, too.

Without ORDER BY, the query is quite fast (less than 0.1 sec), with the ORDER
BY, the MySQL part of the processing takes 57 seconds after server restart, 
then 27 seconds for repeated queries. ARC supports custom indexes for the triple
table(s), but I'm not sure if that would help in this case.
 
ARC2 is a rather simple PHP/MySQL system. The test was done on a 3 year old
MacBook (1.83 GHz Intel Core Duo), with MySQL default settings (something like
8 or 16 MB index memory, not sure off-hand) and skype/twitter/irc client apps
running during the queries, so really not something tuned. I'm working on
"partition-by-predicate", but I couldn't test it yet.

So, while ORDER BY indeed impacts the response time quite painfully, I second 
Andy's thought that the delays you encounter *might* be related to some 
configuration and can probably be improved by tweaking some preferences (don't
ask me which, though ;).

benji

--
Benjamin Nowack
http://bnode.org/
http://semsol.com/

On 26.08.2009 22:11:13, Seaborne, Andy wrote:
>I tried your query and data with Jena/ARQ/TDB on a 32 bit machine with 1G of RAM
>devoted to the query task. 
>Even ice cold (i.e. from system reboot) the worse I can get the system to perform
>at is 99 seconds rather than minutes.  I would not expect Jena/TDB to be
>fundamentally faster than any of the systems you mention in this situation.
>
>In more a normal setup, 28% time goes on the sort of datetime information
>(20seconds), the rest is the pattern matching and reading data into the JVM and the
>query times are around 55s from fairly cold (new JVM, some filesystem caching which
>I guess is the 99s->55s).  Second runs in the same JVM are faster still but the
>sort time is the same.  There is no query results caching. Without the ORDER BY, it
>takes 0.02s to do the query with LIMIT 100.
>
>This is on Windows Vista / 32 bit Java on consumer-grade hardware, 7200rpm disk;
>not a portable, nor server-class hardware.  Java 1.6.0_15; TDB from development
>SVN.  What is your system setup?
>
>I tried a 64 bit machine as well and it's faster in the pattern matching (27s, 47s
>overall) but the sort speed only decreases to a little below 20s (18s) which is
>just it's a faster CPU machine.  
>
>What is happening is that to do the ORDER BY, it has to retrieve all the 342684
>possibilities so the ORDER BY affacts the pattern matching part and incurs the sort
>cost.
>
>Jena/TDB does not assume the dateTime formatting is legal and it checks the
>xsd:dateTime for correctness to make the sorting strict SPARQL; it also uses Java's
>sort routines.  A system that yielded results from pattern matching in an order
>that is dateTime sorted would be much faster (around the 0.02s) but it's a tradeoff
>of generality and data validity assumptions.
>
>
>Talking of data validity - in the data, all the timezones, regardless of time of
>year, are -04:00, which is a strangeness.  The data was generated June 2009.  Hmm.
>
>e.g. "2001-01-29T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>
>
>	Andy
>
>> -----Original Message-----
>> From: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] On
>> Behalf Of Niklas Lindström
>> Sent: 26 August 2009 19:16
>> To: Semantic Web
>> Subject: SPARQL performance for ORDER BY on large datasets
>> 
>> Hi all!
>> 
>> I have a straightforward use case which seems really hard for triple
>> stores to perform for using SPARQL (on huge datasets).
>> 
>> The case: select a set of resources (based on simple criteria such as
>> type) and order them based on a property value (limiting the results
>> to a decent batch size).
>> 
>> Data for test: the Library of Congress RDF dump of subject headings at
>> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 621
>> statements).
>> 
>> The query:
>> 
>>     PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>     PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>>     PREFIX dct: <http://purl.org/dc/terms/>
>> 
>>     SELECT * WHERE {
>>         ?resource a skos:Concept;
>>             dct:modified ?modified .
>>     }
>>     ORDER BY DESC(?modified)
>>     LIMIT 100
>> 
>> I tried this with Sesame (native file store), Virtuoso (opensource
>> edition), and AllegroGraph (free edition), and got terrible results
>> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
>> skipping the type match gets it down to about 20 seconds *at best*.
>> 
>> Sure, it's understandable (AFAIK) that this specific case is much
>> easier to use an SQL DB for (or something like CouchDB). But I was
>> surprised that it was *this* terrible. It seems unfeasible to build
>> e.g. chronological feeds from larger RDF stores using SPARQL with
>> current tools.
>> 
>> 
>> Is this -- ORDER BY performance -- a commonly known problem, and
>> considered an issue of importance (for academia and implementers
>> alike)?
>> 
>> Or am I missing something really obvious in the setup of these stores,
>> or in my query? I welcome *any* suggestions, such as "use triple store
>> X", "for X, make sure to configure indexing on Y". Or do RDF-using
>> service builders in general opt out to indexing in something else
>> entirely in these cases?
>> 
>> 
>> (It seems queries like this are present in the Berlin SPARQL Benchmark
>> (e.g. #8), but I haven't analyzed this correlation and possible
>> meanings of it in depth.)
>> 
>> Best regards,
>> Niklas Lindström
>
Received on Thursday, 27 August 2009 19:26:58 UTC