Re: RDF-Entailment: Remove duplicate anonymous resources - looking for an algorithm from Reto Bachmann-Gmür on 2005-11-23 (semantic-web@w3.org from November 2005)

From: Reto Bachmann-Gmür <reto@gmuer.ch>
Date: Wed, 23 Nov 2005 15:43:56 +0100
To: Joshua Tauberer <tauberer@for.net>
CC: yzqu@seu.edu.cn, semantic-web at W3C <semantic-web@w3c.org>
Message-ID: <4384802C.9020608@gmuer.ch>
I tried to refine the node based approach I was using, here my current 
draft:

makeLean(Graph g) {
 Set knownNotImplying = a new set of Node[2];
 take a pair of nodes n1,n2 where n2 is a bnode and which has not been taken before till no more pairs are possible removeNodeifImplied(n1, n2, knownNotImplying);
}


booelan removeNodeifImplied(Node n1, Node n2,  Set knownNotImplying) {
 List history1 = a new empty list;
 List history2 = a new empty list;
 Set conditionalImplications = a new set of Node[2];
 if (implies(n1, n2, history1, history2, knownNotImplying, conditionalImplications)) {
  for every conditionalImplication delete all statements containing an implied node
  return true;
 }
 return false;
 
}

boolean implies(Node n1, Node n2, List history1, List history2, Set knownNotImplying, Set conditionalImplications) {
 if (n1 == n2) return true; //== is true if they have the same node id;
 if (knownNotImplying.contains({n1, n2}) return false;
 if (conditionalImplications.contains({n1, n2}) return true;
 if history1 contains n2 return false;
 if there is a position pos in history2 so that history2[pos] == n2 {
  if history1[pos] == n1 conditionalImplications.add({n1, n2}) and return true; 
  if removeNodeifImplied(history1[pos], n1, knownNotImplying) conditionalImplications.add({n1, n2}) and return true;
  otherwise return false; //do not add to knownNotImplying as there may be another path/history leading to these nodes
 }
 Set p1 = the set of statements wich contain n1;
 Set p2 = the set of statements wich contain n2;
 for every statement in p2 with no other anonymous node than n2 check if there is an identical statement in p1, otherwise add {n1, n2} to knownNotImpying an return false;
 history1.add(n1);
 history2.add(n2);
 for every other statement (n2, p, o2) or (o2, p, n2) in p2 check if there is a statement (n1, p, o1) respectively (o1, p, n1) in p1 with the same predicate and for which implies(o1, o2, history1.clone(),history2.clone(), knownNotImplying, conditionalImplications) is true, otherwise add {n1, n2} to knownNotImpying an return false;
 conditionalImplications.add({n1, n2}) and return true;
}


I hope this is correct, and I hope that there are more efficient methods! 

I tried the algoritm with the tricky case of  anonymous-nodes circles:

g:
a r b
b r a
c r d
d r c

makeLean(g)
 removeNodeifImplied(a, b)
  implies(a,b, {}, {})
   implies(b,a,{a}, {b},..)
    implies(a,b, {a, b}, {b, a}, ...)
     found that history1 contains n2 - return false;
    return false;
   return false;
  return false;
 return false;
 
 removeNodeifImplied(a, c)
  implies(a,c, {}, {})
   implies(b,d,{a}, {c},..)
    implies(a,c, {a,b}, {c,d},...)
     found that c(n2) == history2[0] and that history1[0] = a(n1);
     conditionalImplications.add({a, c});
     return true;
    conditionalImplications.add({b, d});
    return true;
   return true;
  return true;
  delete 2 statements;
 return true;
 

Maybe this algorithm could be combined with the subgraph approach. 
Certainly things get faster when (I)FPs are considered at the same time, 
one approach would be to give to all functionally grounded nodes a URI  
before starting, which is removed after completion.


reto


Joshua Tauberer schrieb:

> Reto Bachmann-Gmür wrote:
>
>> How do I check the 2^N subgraphs when making b lean?
>
>
> So just to go back a step (to the beginning, really)-  When checking 
> if, for g=a+b, a entails b, there must be a mapping M from blank nodes 
> only in b (not in a) to nodes such that M 'applied' to b is a subset 
> of a. Let's call those blank nodes variables.  Further, no statement 
> in b can have no variables in it.  (If there was such a statement s in 
> b, M(b) would still contain s.  a and b don't overlap, so s couldn't 
> be in a, so M(b) couldn't be a subset of a.)
>
> From this, we can characterize every b by its variables.  If n is a 
> variable, then every statement mentioning n is in b.  And b contains 
> only those statements in g that mention a variable.  Rather than 
> choosing b, we can choose a set of variables and from that generate b.
>
> So rather than considering all 2^N subgraphs of g where N is the 
> number of statements (each statement can be in or out of the 
> subgraph), one can just consider the 2^N possible choices of variables 
> where N is the number of blank nodes (each blank node can be in or out 
> of the choice of variables).  Then for each of the 2^N permutations of 
> blank nodes, choose the subgraph b that has just the statements that 
> mention a variable.
>
> I think that answers your question, but I'll continue rambling a proof 
> about how MSGs get involved:
>
> We can make a further restriction on b that it can't be split into two 
> parts b1 and b2 where each variable is in either b1 or b2 but not 
> both.  (This is like a MSG where you can't cut off a piece that 
> contains all of the instances of some variables, without taking the 
> whole MSG.)  If b can be split that way, then a entails b just when a 
> entails b1 and b2 separately, so we don't need to check b if we check 
> b1 and b2.
>
> This gives us the result that if b is a MSG (in which case every blank 
> node in b is a variable and does not appear outside of b), then no 
> superset of b needs to be checked.  So the largest chunks that need to 
> be checked are MSGs.
>
> We still (so far) need to look at subgraphs of MSGs, which means 
> taking an MSG and looking at its 2^N subgraphs where N is the number 
> of blank nodes within that MSG.  But we still only want to look at 
> 'unsplittable' subgraphs.  If node x is connected to node z only 
> through node y, and if we take the variables to be x and z but not y, 
> then b can be split into a part containing x and a part containing z, 
> so this was a bad choice of variables.  There might be a way to 
> quickly find the unsplittable parts of a MSG using a connectivity map 
> or something... but that needs more thinking.
>
Received on Wednesday, 23 November 2005 14:44:31 UTC