- From: Bob MacGregor <macgregor@ISI.EDU>
- Date: Fri, 3 Jan 2003 09:13:16 -0800
- To: www-rdf-interest@w3.org
- Message-Id: <A5E65596-1F3E-11D7-8B4D-000A27DC4AB0@isi.edu>
Our experience with Jena has exposed some glaring performance weaknesses
in its implementation of reified statements. We hope that these problems
will be rectified in the upcoming Jena 2.0. However, the issues that
surface will have to be addressed by any triple store implementation.
In some of our applications the truth of a statement (triple) is 
relative
instead of absolute, ranked according to a probability or to a degree of
trust. The basic processing loop retrieves all statements that match a
particular pattern, and then sifts through the retrieved statements to
pick out the winner according to some metric. In Jena, a reified
statement may or may not be indexed. If its not, then our processing 
loop
will not find it. Hence, for our applications ALL statements are 
indexed.
In API terms, this means that all statements must contained by (added 
to)
a model, whether or not they are reified, and whether or not they can be
considered to be 'true'. Effectively, this means that the Jena 'bit' 
that
records which statements have been added to a model is useless.
So the first lesson is that all statements should be indexed (note:
heavyweight KR systems -- CycL, Epikit, Epilog, SNePS, Loom, PowerLoom 
--
already do this).
Secondly, there should be a 'bit' that API users can use to mark
statements as true or not. However, it really should be 'wider' than a
single 'bit'. Give us enough bits (e.g., make it a resource), and we can
use such an attachment to build our own context mechanisms.
Next, consider two basic triple store operations (currently missing in
Jena): 'deleteResource' and 'renameResource'. To delete a resource R 
from
a model M means to eliminate all statements in M that reference R. 
Sounds
simple, right? Retrieve all statements, with R in subject position and
delete them. Do the same for R in predicate and object positions. Now
recurse: for every deleted statement, if it appears in subject or object
position (i.e., it its reified), deleted the statement containing it. 
And
so on.
In Jena this operation can be performed semi-efficiently only if all
statements are indexed in M. If some statements are reified but not 
added
to the model ('reifiedOnly' in Jena terms), then a linear scan of all
reified only statements is needed to search for statements that 
reference
R (to some level of nesting). In the worst case, this makes our delete
operation take quadratic time. In our implementation of deleteResource,
we don't bother to scan for 'reifiedOnly' statements, since the
performance would be unaccepable, and as we indicated in our opening, we
have other reasons for avoiding 'reifiedOnly' statements.
Note that I used the term 'semi-efficiently'. For the most common triple
store applications, reified statements form a small percentage of
statements in a model. Suppose our resource R appears in 10 statements,
none of which are reified. Then the algorithm outlined above will make 
11
probes/queries to the triple store to eliminate those statements. 
Suppose
the triple store API provided a 'bit' (i.e., a quick test) to determine
whether or not a statement is reified. Then instead of 11 probes, our
delete operation would require only one probe. Now its efficient.
Unfortunately, Jena provides a 'reifiedOnly' test, but does not provide
an 'isReified' test. So, another suggestion would be to reverse that
particular decision. Note that having a fast 'isReified' test would also
speed up applications such as those alluded to at the opening, that
attach probabilities or whatever to statements. If most statements are
not reified, the availability of an 'isReified' test can eliminate the
occurrence of additional probes that look for probability statements 
that
aren't there (i.e., our 'basic processing loop' ceases to be a loop
most of the time).
Note: The algorithm to implement a 'renameResource' method is nearly
identical to 'deleteResource'.
Finally, I don't mean to pick on Jena, which currently is the one of the
greatest things to come along since 'sliced bread' (not sure what the
British equivalent of sliced bread might be). I would imagine that other
triple stores might have as many or more problems with reified
statements, but we haven't tried out any other systems.
Cheers, Bob
Attachments
- text/enriched attachment: stored
Received on Friday, 3 January 2003 12:11:29 UTC