RE: SPARQL performance for ORDER BY on large datasets from Sampo Syreeni on 2009-08-26 (semantic-web@w3.org from August 2009)

From: Sampo Syreeni <decoy@iki.fi>
Date: Thu, 27 Aug 2009 01:54:23 +0300 (EEST)
To: "Seaborne, Andy" <andy.seaborne@hp.com>
cc: Niklas Lindström <lindstream@gmail.com>, Semantic Web <semantic-web@w3.org>
Message-ID: <Pine.LNX.4.64.0908270138480.27480@lakka.kapsi>

On 2009-08-26, Seaborne, Andy wrote:

> What is happening is that to do the ORDER BY, it has to retrieve all 
> the 342684 possibilities so the ORDER BY affacts the pattern matching 
> part and incurs the sort cost.

Sort of, but not quite. Instead it could work backwards beginning with 
the ordered timestamps, and go from there. So, as a relational kinda 
guy, to me this seems like an optimization problem. One that I'd easily 
solve under an RDBMS, and one that would necessitate e2e optimization 
across the RDF layer in the case of triple stores.

The way to go with a relational database would be to have a 
partitioned-by-type relation of instances indexed on time (or a 
composite index on type major, time minor). The execution plan would 
translate into a lookup only covering the partitition with this specific 
type of object, and a linear index scan in temporal order, interrupted 
after the first 100 tuples. Whatever else you wanted would then probably 
be translated into a nested-pipelined join towards relation data proper.

In theory a columnar organization by attribute, with an index on the 
table holding the time, should optimize reasonably well across various 
RDF layers as well. And as per Stonebraker's comments, doubly so if the 
underlying database was column oriented to start with.
-- 
Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

Received on Wednesday, 26 August 2009 23:41:16 UTC