Re: Query Execution Speed - Sesame Native Store from Jeen Broekstra on 2013-08-19 (public-sparql-dev@w3.org from July to September 2013)

From: Jeen Broekstra <jeen.broekstra@gmail.com>
Date: Mon, 19 Aug 2013 13:02:28 +1200
To: public-sparql-dev@w3.org
Message-ID: <52116EA4.5070300@gmail.com>
On 16/08/13 06:21, Souza, Renan F. S. wrote:
> Hello, all!
> I have some questions regarding 3 aspects of Sesame influencing Query
> Execution Speed.

You crosspost to three mailinglists at once yet omit the one that is 
actually specifically meant for questions about Sesame.

Cutting down to public-sparql-dev only since that is probably the most 
applicable of the ones you selected, but I would suggest that if you 
want to continue this discussion, you steer it towards the Sesame 
discussion mailinglist. See 
https://lists.sourceforge.net/lists/listinfo/sesame-general .

> 1) Indexes
>
> I am using Sesame 2.7 Native Store with the indexes
> spoc,sopc,psoc,posc,opsc,ospc.
> I am not using Contexts so I just used all combinations of indexes for
> (s,p,o) because, according to Sesame User Guide
> <http://www.openrdf.org/doc/sesame2/users/ch07.html#section-native-store-config>,
> the more indexes the faster the query execution will be. I care much
> more about query speed than loading time and space on disk.
>
> Question: Is my approach right? Using all those indexes together will
> really speed up my queries? Or may it somehow slow down depending on
> something that I am not taking into consideration?

If you do not use contexts, you do not need this many indexes. Consider: 
an index is meant to retrieve complete triples given one or more known 
values for S, P, or O. To have full index cover, you reallly only need 
three indices: (1) SPOC, (2) OSPC, and  (3) POSC. Index 1 will cover all 
cases where S or S and P are known. 2 will cover alles cases where O or 
O and S are known, and 3 will cover all cases where P or P and O are 
known. This covers all valid partial triple patterns. The two corner 
cases (all values known or no values known) can be solved by any index.

> 2) Contexts, named graphs.
>
> As I understood, contexts within my Sesame RDF repository work more as
> an "identifier" of a subset of the dataset.  I am not using contexts at
> all so I just ignore the "c" part of the quads and I only include triples.
>
> Should I also think about modeling my problem considering contexts?
> Hence, include more indexes, maybe all combinations of the four letters
> (s,p,o,c)? Will it speed up my queries if I include contexts in my
> statements?

It might, but it's impossible to say without knowing more about your 
problem domain or what kind of queries you do. Context (or named graphs) 
is an organization/grouping mechanism, its primary purpose is 
cataloguing rather than performance improvement. But of course if you 
catalogue correctly, retrieval may be improved simply because you can 
write simpler queries.

> 3) Loading interruption
>
> While loading data, my application may be interrupted (because of power
> outage or connection loss, for example). May that cause any serious
> damage to the repository, corrupting the indexes hence slowing down my
> queries?

Yes. But it is relatively unlikely as there are transaction failsafes 
and recovery mechanisms in place.

> PS: I have around 1M triples. I just dropped my whole database and
> started building a new one all over to discard any problems related to
> database corruptions and size. (Before dropping I had over 70M triples
> and the queries were as slow as now with 1M triples).

The native store should easily cope with 70M triples, let alone just 1M. 
But it's impossible to tell what is going wrong without knowing more 
about your setup. I suggest you mail the Sesame discussion list with 
additional details.

Regards,

Jeen
Received on Monday, 19 August 2013 01:02:59 UTC