Re: SPARQL performance for ORDER BY on large datasets from Andreas Langegger on 2009-09-01 (semantic-web@w3.org from September 2009)

From: Andreas Langegger <al@jku.at>
Date: Tue, 01 Sep 2009 23:20:01 +0200
To: Brian Stamper <stamper.10@osu.edu>
CC: semantic-web@w3.org
Message-ID: <4A9D9001.9010009@jku.at>

Hello,

it depends on the query processor and the sort algorithm used. The main 
bottleneck is probably memory causing the operating system to swap to 
disk and start trashing and as a consequence becoming even slower.

Jena ARQ uses Java's Arrays.sort() [1,2]: "The sorting algorithm is a 
modified mergesort [...] This algorithm offers guaranteed n*log(n) 
performance." - For really large datasets a sort will always be expensive.

Side note: it's the same with RDBMS - just do an SQL ORDER BY over an 
attribute which is not indexed!

Regards AndyL

[1] 
http://jena.svn.sourceforge.net/viewvc/jena/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/iterator/QueryIterSort.java?view=markup

[2] 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Arrays.html#sort(T[],%20java.util.Comparator)

Brian Stamper schrieb:
> 
> A curious student-coder-in-training asks:
> 
> How does the number of triples correlate to the processing time of ORDER 
> BY; e.g. is it logarithmic, quadratic, exponential, etc.? Is this an 
> issue of the sort algorithm, or the structure of the data, or..?
> 
> Brian Stamper
> 
>

Received on Tuesday, 1 September 2009 21:20:51 UTC