Efficient context mechanisms from Bob MacGregor on 2003-01-10 (www-rdf-interest@w3.org from January 2003)

From: Bob MacGregor <macgregor@ISI.EDU>
Date: Fri, 10 Jan 2003 11:02:15 -0800
To: Dave Reynolds <der@hplb.hpl.hp.com>
Cc: www-rdf-interest@w3.org, jena-dev@yahoogroups.com
Message-Id: <5.1.1.6.0.20030110104721.00b66778@tnt.isi.edu>
Dave,

I'm switching to a new thread, because issue of contexts is getting
confused with the reification issue, and to some extent they are
separable.

On Friday, January 10, 2003, at 02:57 AM, Dave Reynolds wrote:
      Bob,
      Most of your points are probably best addressed by Pat, or others, 
but there
      were two side comments that I wanted to respond to.
          We would be happy to send you the source code for Loom or
          PowerLoom if you want to see what an efficient (and very clever) 
context mechanism
      looks  Interesting. Do you have any references to papers on this 
(other than the source
      code itself)? I've read some of the papers on the Loom classifier but 
not seen
      any on the context system.

I'm sorry, but the answer is no, we don't have any write-ups on the Loom 
context
mechanism(s). However, I will give you some highlights here. The core 
implementations
of contexts in PowerLoom and Loom are virtually identical, so for the 
sequel I'll just use the term
Loom.

In Loom, contexts reside in a tangled hierarchy. Statements themselves have no
assigned truth values. If memory serves, Loom uses a three-valued logic (T, 
F, and U),
and PowerLoom uses a 5-valued logic (adding default T and default F).
PowerLoom can also allow numeric/probabilistic values to be associated with 
statements.
The context mechanism doesn't care how many truth values are being 
manipulated;
the machinery is orthogonal to that issue.

I will use the term 'statement' to refer to a logical sentence, and
'assertion' to refer to a statement having an assigned truth value. 
Assertions residing in a
context A are inherited by all contexts that are descendants of A.
These assertions can be overridden, so if context B inherits an assertion 
stating that S1 is true, B
may contain an assertion that in it (and its descendants) the truth value 
of S1 is false,
or unknown, or whatever.

This context mechanism, then, is about as general as you might imagine along
certain dimensions. It was invented, as I mentioned earlier, by Austin 
Tate's group
in Edinburgh. They needed it because their planning system, OPLAN, was 
designed
to explore large numbers of possible worlds/models, and they needed a very
efficient context switching mechanism. They were so enamored of their context
mechanism that they envisioned re-engineering it in silicon. However, I 
don't think
that ever happened.

Unfortunately, Austin's group didn't ever publish their context mechanism. 
That
was OK with us, because they gave us their source code, and it was only about
a page of C code, easily decipherable. We generalized their algorithm in 
two ways --
we generalized it to handle multiple inheritance, and we invented a notion 
of a
'home context' that significantly reduces the space requirements in the 
average
case (the case where a truth value for a statement is asserted in one and 
only one
context). Unfortunately, we never wrote up our algorithm either. Its 
written in
Common Lisp, and its only a couple of pages, but certain aspects of it are
counter intuitive.  Furthermore, some of the stuff is implemented as a part
of the language itself, rather than as a part of the application (I have a 
third
implementation of contexts where the code is pulled back out of the
language; its not that much different).
I would guess that Tom Russ, Loom's caretaker, would be more
than happy to elaborate on the details if you cared to contact him.

Now for the good part. The Loom context mechanism is VERY fast. I believe
that the average asymptotic time to lookup the truth value of a statement 
within a context
hierarchy is O(n), where n is the average number of truth values per 
statement.
The average number of truth values per statement in a normal application is 
probably
a number between 1 and 3, so that means that context sensitive lookup is 
basically
a constant-time operation, independent of the number of contexts.

The caveat is that the algorithm I've just alluded to is a main-memory based
algorithm. I would guess that some variant of it can be used when one moves
to a persistent storage model, but none of us have tried to work out the 
details
(yet). Its possible that we would hit some kind of snag that would 
negatively impact
the asymptotic performance. When looking at Jena internals, I've been keeping
the context machinery in the back of my mind, which is partly why I'm 
paying attention
to things like the 'reifiedOnly' bit.

If you ever decide to get serious about contexts, it would probably be 
worth your
while to assign someone the task of deciphering our context algorithm (or 
Austin's).
It may be overkill for RDF, but some of the lessons learned probably apply to
simpler context systems.

     But the fact is, if the "shortcut" reification that now exists in Jena
     were modified only slightly, we could get significant performance 
improvements
     for reasoning with the kinds of examples listed above.
     I'd like to second Brian's suggestion that you work with us tool 
developers on
     this to find a pragmatic solution (perhaps over on jena-dev or 
jena-devel rather
     than rdf-interest). If I understand our earlier discussion then it 
still seems
     plausible to find a solution. Setting aside the wider semantic issues, 
and
     thinking just in implementation terms, then the implementation costs of
     switching from a "statement" to a "statings" view of reification might be
     mitigated by optimizing the case where an application only wants to 
have one
     "reification" for each statement.

Ah yes, statement v. statings. After getting a better understanding of this 
issue
(I hear what they are saying, but still can't believe what I'm hearing), it 
has occurred
to me that I may be able to finesse the issue by pretending that the terms
'statement' and 'stating' are synonymous within my own applications. I would
do just what you appear to be suggesting -- mandate that there be only one
stating per statement, and then use the stating wherever I really intended
to use a 'reified statement' (here I'm using the traditional sense of this 
phrase,
not RDF's misuse of it) . If this turns out to be a viable option, then my 
interest
in the internals of how Jena manages "shortcut" reification (i.e., nested
statements) will continue to be large.

Note: with the exception of scrutinizing the database storage layout for Jena
RDB models, I've confined myself to looking hard at the Jena API but not
looking below it (i.e., I haven't looked at the code that implements it). 
That could
put me somewhat at a disadvantage when discussing implementation
issues.  And, of course, I haven't seen Jena 2.0.  Are their threads other
than 'jena-dev' where implementations issues are discussed?

Cheers, Bob
Received on Friday, 10 January 2003 14:02:20 UTC