- From: Bob MacGregor <macgregor@ISI.EDU>
- Date: Wed, 15 Jan 2003 12:29:22 -0800
- To: Dave Reynolds <der@hplb.hpl.hp.com>
- Cc: www-rdf-interest@w3.org, tar@ISI.EDU, hans@ISI.EDU
Warning: This answer to Dave's comment, makes reference to another response by Tom Russ. I couldn't easily refer to both at the same time, so readers will have to do a bit of thread jumping to follow things. Sorry. At 05:47 PM 1/15/2003 +0000, Dave Reynolds wrote: >Bob, > > > >What is the space complexity of the algorithm like? > > > > The contexts are organized into a hierarchy, and numbered so > > that a parent's number is always smaller than the numbers of its > > descendants. Some sort of data structure is needed that can quickly > > determine, given two contexts C1 and C2, whether C1 inherits C2. > > One could imagine that the space requirements for this would be > > superlinear in the worst case, but they should be linear (in the number > > of contexts) for the average case. > > > > Each statement/triple points to a list of contexts where it has > > appeared within an assertion. If a triple [S P O] is asserted to be true in > > contexts C1, C2, and C3, then there is a triple object that points > > to a list containing C1, C2, and C3 (ordered according to the > > numbers they are assigned, largest number first). > >So you have structure sharing in the sense that there is only one triple >object >which can appear in multiple contexts. In the case where there are lots of >triples (N) and lots of contexts (M) then the worst case cost is dominated by >those membership lists which would be worst case O(NxM). However, I >imagine the >lists also structure share, and that the ordering helps with that, so the >average case will be *much* better than that. > >Presumably there is also an additional index structure of sort to allow you to >get the constant-time truth value lookup - that also needs to have good >scaling >properties. Tom Russ is saying that we don't have constant time lookup. Strictly speaking, he is correct. Given a context, our algorithm runs up the parent links in the hierarchy looking for a match with an ancestor context. This part of the code runs so fast that optimizing out the non-constant factor isn't worth it in practice. If our context trees got very deep, then we would need to redo this aspect of the code. However, we have lost the two numbers that are most relevant. One number is P, the average number of contexts per statement. The other is D, the average depth of the context tree. P is typically a small number between 1 and 2, typically much closer to 1 than 2. D is usually a number between 2 and 10, usually much closer to 2. The algorithm we implement in Loom runs in time something like P + D to evaluate the truth of a statement in a context. If D became large (say > 20), we might choose to optimize away the lookup in the context tree, yielding a time of D. However, the constant factor for a single lookup would increase by quite a bit. So, our actual implementation chooses a simpler algorithm that optimizes for the average case running time rather than the worst case. Empirically, we can observe that its been the correct choice for all applications we've seen so far. > > If we were using a quad representation, then memory is linear in > > the number of quads. > >Yes ... but presumably the number of quads is NxM not just N. In your case it >sounds like M could grow quite large which might be a problem. I think we might have a disconnect here regarding how all of this gets used. Suppose we have 100 contexts (it would be rare to have that many, but if do something like use contexts to represent different time slides, then we could get this many quite easily). Suppose there are 10,000 statements. We would normally expect that most of the statements are asserted only once. A few of the statements (say 100) might get asserted many times, in different contexts. So the number of quads is 10,100. That means that P is 1.01. If we used quads as a way of REPLICATING statements, so they appear in lots of different contexts, the the number of quads might get much larger than N. However, that's where context inheritance comes into play. If a group of statements needs to be used in lots of different contexts, we put that "group" into its own context, and inherit it wherever its needed. That allows the number of quads to be not much larger than N, independent of M, the number of contexts. Cheers, Bob
Received on Wednesday, 15 January 2003 15:40:11 UTC