Re: Use cases for Reification in RDF Triple stores

Useful comments - thank you. I'm copying this response over to jena-dev as well,
which is the better place for Jena-specific discussions.

The first point to clarify is that what you are talking about here is the
Jena-specific reification "shortcut". This attempts to give you some of the
features of reification, without the overhead of asserting four triples for
every reified statement. However, there is nothing to stop you using the full
official RDF reification approach with Jena - assert all four reification
triples and manipulate them accordingly, ignore the shortcut stuff like
"isReified". I believe that would allow you to implement a non-quadratic
deleteResource.

Secondly, it is not completely true that reified statements are not indexed in
Jena. In JenaRDB (Jena-BDB doesn't support the reification shortcut anyway) all
statements are indexed whether reified or not and whether asserted or
reified_only (yes, there are effectively two "bits", not just one). However, it
is true that this indexing isn't easily accessible through the normal API. 

In designing Jena2 we debated whether to drop the reification shortcut and just
support explicit reification quads but decided to attempt to preserve it. It
sounds like your use case would be a good test of the new API design - I'll
forward it to jena-devel.

> Secondly, there should be a 'bit' that API users can use to mark
> statements as true or not. However, it really should be 'wider' than a
> single 'bit'. Give us enough bits (e.g., make it a resource), and we can
> use such an attachment to build our own context mechanisms.

Not sure about this. In RDF, statements are only asserted. The semantics of an
RDF graph is just the conjunction of the individual statements. There is no
notion of a not-asserted statement. Reification, according to the current model
theory, is just a way of referring to a triple that is asserted in some other
RDF document (i.e. it is a stating, not a reference to some abstract Statement).
Personally, I've got no problems with an application choosing to use reification
as a way of separating statements from their truth status but I'm not sure this
should be built in to APIs like Jena. If the working group had said reification
was about statements not statings I'd be happier with this, but it didn't. 

Furthermore, there would be some nasty interactions between this statement
"truth status field" and Jena models - the truth status should presumably be a
property of the pair [statement, model] rather than just a property of the
statement. This then suggests a different design approach for you ...

Why not just use Jena Models to provide your context? For example, use one Model
to contain all your statements of unknown truth status and a separate Model to
contain the current world view - i.e. the current set of "asserted" statements.
In the first model you could include all your trust and probability information
using reification and now you can use the reification shortcut without any loss
of searchability. Personally I find this explicit separation of Models more
intuitive than switching the reification status of statements within a single
Model. Indeed a custom Model implementation could presumably chose to record the
model->statement map by putting the Model into a field on Statement instead of
the other way round (at a cost to structure sharing). This could still conform
to the Jena API but in implementation terms would be very similar to your
generalized context bit. 

Dave

Bob MacGregor wrote:
> 
> Our experience with Jena has exposed some glaring performance weaknesses
> in its implementation of reified statements. We hope that these problems
> will be rectified in the upcoming Jena 2.0. However, the issues that
> surface will have to be addressed by any triple store implementation.
> 
> In some of our applications the truth of a statement (triple) is relative
> instead of absolute, ranked according to a probability or to a degree of
> trust. The basic processing loop retrieves all statements that match a
> particular pattern, and then sifts through the retrieved statements to
> pick out the winner according to some metric. In Jena, a reified
> statement may or may not be indexed. If its not, then our processing loop
> will not find it. Hence, for our applications ALL statements are indexed.
> In API terms, this means that all statements must contained by (added to)
> a model, whether or not they are reified, and whether or not they can be
> considered to be 'true'. Effectively, this means that the Jena 'bit' that
> records which statements have been added to a model is useless.
> 
> So the first lesson is that all statements should be indexed (note:
> heavyweight KR systems -- CycL, Epikit, Epilog, SNePS, Loom, PowerLoom --
> already do this).
> 
> Secondly, there should be a 'bit' that API users can use to mark
> statements as true or not. However, it really should be 'wider' than a
> single 'bit'. Give us enough bits (e.g., make it a resource), and we can
> use such an attachment to build our own context mechanisms.
> 
> Next, consider two basic triple store operations (currently missing in
> Jena): 'deleteResource' and 'renameResource'. To delete a resource R from
> a model M means to eliminate all statements in M that reference R. Sounds
> simple, right? Retrieve all statements, with R in subject position and
> delete them. Do the same for R in predicate and object positions. Now
> recurse: for every deleted statement, if it appears in subject or object
> position (i.e., it its reified), deleted the statement containing it. And
> so on.
> 
> In Jena this operation can be performed semi-efficiently only if all
> statements are indexed in M. If some statements are reified but not added
> to the model ('reifiedOnly' in Jena terms), then a linear scan of all
> reified only statements is needed to search for statements that reference
> R (to some level of nesting). In the worst case, this makes our delete
> operation take quadratic time. In our implementation of deleteResource,
> we don't bother to scan for 'reifiedOnly' statements, since the
> performance would be unaccepable, and as we indicated in our opening, we
> have other reasons for avoiding 'reifiedOnly' statements.
> 
> Note that I used the term 'semi-efficiently'. For the most common triple
> store applications, reified statements form a small percentage of
> statements in a model. Suppose our resource R appears in 10 statements,
> none of which are reified. Then the algorithm outlined above will make 11
> probes/queries to the triple store to eliminate those statements. Suppose
> the triple store API provided a 'bit' (i.e., a quick test) to determine
> whether or not a statement is reified. Then instead of 11 probes, our
> delete operation would require only one probe. Now its efficient.
> 
> Unfortunately, Jena provides a 'reifiedOnly' test, but does not provide
> an 'isReified' test. So, another suggestion would be to reverse that
> particular decision. Note that having a fast 'isReified' test would also
> speed up applications such as those alluded to at the opening, that
> attach probabilities or whatever to statements. If most statements are
> not reified, the availability of an 'isReified' test can eliminate the
> occurrence of additional probes that look for probability statements that
> aren't there (i.e., our 'basic processing loop' ceases to be a loop
> most of the time).
> 
> Note: The algorithm to implement a 'renameResource' method is nearly
> identical to 'deleteResource'.
> 
> Finally, I don't mean to pick on Jena, which currently is the one of the
> greatest things to come along since 'sliced bread' (not sure what the
> British equivalent of sliced bread might be). I would imagine that other
> triple stores might have as many or more problems with reified
> statements, but we haven't tried out any other systems.
> 
> Cheers, Bob

Received on Monday, 6 January 2003 07:25:34 UTC