Re: Efficient context mechanisms from Bob MacGregor on 2003-01-15 (www-rdf-interest@w3.org from January 2003)

From: Bob MacGregor <macgregor@ISI.EDU>
Date: Wed, 15 Jan 2003 12:29:22 -0800
To: Dave Reynolds <der@hplb.hpl.hp.com>
Cc: www-rdf-interest@w3.org, tar@ISI.EDU, hans@ISI.EDU
Message-Id: <5.1.1.6.0.20030115115512.00bb2e10@tnt.isi.edu>
Warning: This answer to Dave's comment, makes reference to
another response by Tom Russ.  I couldn't easily refer to both
at the same time, so readers will have to do a bit of thread jumping
to follow things.  Sorry.

At 05:47 PM 1/15/2003 +0000, Dave Reynolds wrote:
>Bob,
>
> > >What is the space complexity of the algorithm like?
> >
> > The contexts are organized into a hierarchy, and numbered so
> > that a parent's number is always smaller than the numbers of its
> > descendants.  Some sort of data structure is needed that can quickly
> > determine, given two contexts C1 and C2, whether C1 inherits C2.
> > One could imagine that the space requirements for this would be
> > superlinear in the worst case, but they should be linear (in the number
> > of contexts) for the average case.
> >
> > Each statement/triple points to a list of contexts where it has
> > appeared within an assertion. If a triple [S P O] is asserted to be true in
> > contexts C1, C2, and C3, then there is a triple object that points
> > to a list containing C1, C2, and C3 (ordered according to the
> > numbers they are assigned, largest number first).
>
>So you have structure sharing in the sense that there is only one triple 
>object
>which can appear in multiple contexts. In the case where there are lots of
>triples (N) and lots of contexts (M) then the worst case cost is dominated by
>those membership lists which would be worst case O(NxM). However, I 
>imagine the
>lists also structure share, and that the ordering helps with that, so the
>average case will be *much* better than that.
>
>Presumably there is also an additional index structure of sort to allow you to
>get the constant-time truth value lookup - that also needs to have good 
>scaling
>properties.

Tom Russ is saying that we don't have constant time lookup.  Strictly speaking,
he is correct.  Given a context, our algorithm runs up the parent links in
the hierarchy looking for a match with an ancestor context.  This part of the
code runs so fast that optimizing out the non-constant factor isn't worth it
in practice.  If our context trees got very deep, then we would need to
redo this aspect of the code.

However, we have lost the two numbers that are most relevant.  One
number is P, the average number of contexts per statement.  The other
is D, the average depth of the context tree.  P is typically a small
number between 1 and 2, typically much closer to 1 than 2.  D is
usually a number between 2 and 10, usually much closer to 2.

The algorithm we implement in Loom runs in time something like
P + D to evaluate the truth of a statement in a context.  If D became
large (say > 20), we might choose to optimize
away the lookup in the context tree, yielding a time of D.  However,
the constant factor for a single lookup would increase by quite a bit.
So, our actual implementation chooses a simpler algorithm that
optimizes for the average case running time rather than the worst case.
Empirically, we can observe that its been the correct choice for all 
applications
we've seen so far.

> > If we were using a quad representation, then memory is linear in
> > the number of quads.
>
>Yes ... but presumably the number of quads is NxM not just N. In your case it
>sounds like M could grow quite large which might be a problem.

I think we might have a disconnect here regarding how all of this gets 
used.  Suppose
we have 100 contexts (it would be rare to have that many, but if do something
like use contexts to represent different time slides, then we could get 
this many
quite easily).  Suppose there are 10,000 statements.  We would normally expect
that most of the statements are asserted only once.  A few of the statements
(say 100) might get asserted many times, in different contexts.  So the 
number of
quads is 10,100. That means that P is 1.01.

If we used quads as a way of REPLICATING statements, so they appear in
lots of different contexts, the the number of quads might get much larger
than N.  However, that's where
context inheritance comes into play.  If a group of statements needs to be
used in lots of different contexts, we put that "group" into its own context,
and inherit it wherever its needed.  That allows the number of quads to be
not much larger than N, independent of M, the number of contexts.

Cheers, Bob
Received on Wednesday, 15 January 2003 15:40:11 UTC