Re: indexing and ordered searches from Andy Seaborne on 2013-01-22 (semantic-web@w3.org from January 2013)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Tue, 22 Jan 2013 16:36:31 +0000
To: semantic-web@w3.org
Message-ID: <50FEC00F.7020605@epimorphics.com>

On 22/01/13 00:11, Jeremy J Carroll wrote:
 >
 > What triple stores offer any optimization for integer or other numeric
 > queries, e.g. with the following query
 >
 > SELECT  ?a
 > WHERE { ?a  eg:prop ?i .
 >          FILTER ( 10 < ?i && ?i < 15 ) }
 >
 >
 > a naive approach would be to find all eg:prop triples and compute the
 > filter for each one (O(N) complexity); if the store only has say
 > integers as objects of eg:prop then a Predicate/Object index could
 > achieve the same result in O(log(N)) by a numeric binary chop on the
 > Object part of the index.
 >
 > I looked at the TDB code and got the impression that the indices do in
 > fact preserve ordering of at least some types, but could not see FILTERs
 > being treated in the way indicated here … then I thought I would ask 
here.
 >
 > Jeremy

The TDB indexes do preserve ordering, sort of. Numbers (integer, 
decimals), as well as dates and datetimes, are stored in the index 
itself if possible. An integer value has to fit in 56 bits (signed). 
Indication of datatype is preserved.  Otherwise, oversized literals are 
stored in a different table.  Doubles aren't optimized currently as it 
would loose precision (if node id size increases or becomes variable 
(compression) then they will be).

In the shipped indexing configuration there is a POS index, as a B+Tree, 
so it is ordered sufficiently for this query if restricted to integers 
and decimals.

Your example would be to start at
    (eg:prop, 10^^integer) -> (eg:prop, 15^^integer)
and also look at:
    (eg:prop, 10^^decimal) -> (eg:prop, 15^^decimal).
and it's all index -access only if you assume it's within the 56 bit limit.

If you have to cope with very large integers, or decimals with large 
absolute values and a lot of fraction value, this will miss those.

Each leaf block has 100-200 triples worth of id so in the integer case 
likely to be one block access.  If the data is loaded with tdbloader2, 
the leaf disk blocks are very likely adjacent on disk so likely to be 
fast access even on rotational disk.

The encoding inline into the index gives a very significant increase in 
performance over needing to go to a node table to get the value for the 
FILTER test.  My understanding is that there going to be observable 
benefits when there are very large numbers of eg:prop with values 
outside that range.

	Andy

Received on Tuesday, 22 January 2013 16:37:01 UTC