- From: Bob MacGregor <macgregor@ISI.EDU>
- Date: Fri, 3 Jan 2003 09:13:16 -0800
- To: www-rdf-interest@w3.org
- Message-Id: <A5E65596-1F3E-11D7-8B4D-000A27DC4AB0@isi.edu>
Our experience with Jena has exposed some glaring performance weaknesses in its implementation of reified statements. We hope that these problems will be rectified in the upcoming Jena 2.0. However, the issues that surface will have to be addressed by any triple store implementation. In some of our applications the truth of a statement (triple) is relative instead of absolute, ranked according to a probability or to a degree of trust. The basic processing loop retrieves all statements that match a particular pattern, and then sifts through the retrieved statements to pick out the winner according to some metric. In Jena, a reified statement may or may not be indexed. If its not, then our processing loop will not find it. Hence, for our applications ALL statements are indexed. In API terms, this means that all statements must contained by (added to) a model, whether or not they are reified, and whether or not they can be considered to be 'true'. Effectively, this means that the Jena 'bit' that records which statements have been added to a model is useless. So the first lesson is that all statements should be indexed (note: heavyweight KR systems -- CycL, Epikit, Epilog, SNePS, Loom, PowerLoom -- already do this). Secondly, there should be a 'bit' that API users can use to mark statements as true or not. However, it really should be 'wider' than a single 'bit'. Give us enough bits (e.g., make it a resource), and we can use such an attachment to build our own context mechanisms. Next, consider two basic triple store operations (currently missing in Jena): 'deleteResource' and 'renameResource'. To delete a resource R from a model M means to eliminate all statements in M that reference R. Sounds simple, right? Retrieve all statements, with R in subject position and delete them. Do the same for R in predicate and object positions. Now recurse: for every deleted statement, if it appears in subject or object position (i.e., it its reified), deleted the statement containing it. And so on. In Jena this operation can be performed semi-efficiently only if all statements are indexed in M. If some statements are reified but not added to the model ('reifiedOnly' in Jena terms), then a linear scan of all reified only statements is needed to search for statements that reference R (to some level of nesting). In the worst case, this makes our delete operation take quadratic time. In our implementation of deleteResource, we don't bother to scan for 'reifiedOnly' statements, since the performance would be unaccepable, and as we indicated in our opening, we have other reasons for avoiding 'reifiedOnly' statements. Note that I used the term 'semi-efficiently'. For the most common triple store applications, reified statements form a small percentage of statements in a model. Suppose our resource R appears in 10 statements, none of which are reified. Then the algorithm outlined above will make 11 probes/queries to the triple store to eliminate those statements. Suppose the triple store API provided a 'bit' (i.e., a quick test) to determine whether or not a statement is reified. Then instead of 11 probes, our delete operation would require only one probe. Now its efficient. Unfortunately, Jena provides a 'reifiedOnly' test, but does not provide an 'isReified' test. So, another suggestion would be to reverse that particular decision. Note that having a fast 'isReified' test would also speed up applications such as those alluded to at the opening, that attach probabilities or whatever to statements. If most statements are not reified, the availability of an 'isReified' test can eliminate the occurrence of additional probes that look for probability statements that aren't there (i.e., our 'basic processing loop' ceases to be a loop most of the time). Note: The algorithm to implement a 'renameResource' method is nearly identical to 'deleteResource'. Finally, I don't mean to pick on Jena, which currently is the one of the greatest things to come along since 'sliced bread' (not sure what the British equivalent of sliced bread might be). I would imagine that other triple stores might have as many or more problems with reified statements, but we haven't tried out any other systems. Cheers, Bob
Attachments
- text/enriched attachment: stored
Received on Friday, 3 January 2003 12:11:29 UTC