Re: SPARQL performance for ORDER BY on large datasets

On 2009-09-01, Brian Stamper wrote:

> How does the number of triples correlate to the processing time of 
> ORDER BY; e.g. is it logarithmic, quadratic, exponential, etc.? Is 
> this an issue of the sort algorithm, or the structure of the data, 
> or..?

In the optimum, in general, and in the average case it is n*log(n) in 
the number of input bytes to be sorted. If the amount of returned 
tuples/triples can be bounded to within a constant, hashing can bring 
that down to n. In general, and in the worst case, it depends very much 
on the storage organization, and especially the constant in front of the 
expression can change drastically. Finally, if there is some serious 
optimization going on, you can bring the partial cost downto the amount 
of data that needs to be inspected in order to make a decision -- which 
can be below linear sometimes -- plus the linear cost of data delivered, 
plus-minus compressibility (which in the offline case can yield you 
below linear costs as well, in the optimum, but never in average unless 
your data is certain to be distributed in an uneven fashion). The 
precise query plan also sometimes yields surprises, because the 
asymptotically optimal plan rarely coincides with the optimum plan 
against a finite set of data (and optimizer statistics). This means that 
in the subasymptotic regime, changes in plan can drastically change the 
total complexiy, and so cause serious discontinuities/jointness of the 
partial costs, until the total cost approaches the asymptotic regime in 
all parameters simultaneously.
-- 
Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

Received on Tuesday, 1 September 2009 21:50:35 UTC