Re: RDF-Entailment: Remove duplicate anonymous resources - looking for an algorithm

Reto Bachmann-Gmür wrote:
> sound reasonable, still quite insecure about terminology, would this be 
> the same to say "... g rdf-entails g1 and g1 rdf-entails g"?

Yeah, I think so.

> I have been looking at the concept of "rdf molecules" [1] which seems to 
> be very close to MSG.

Looks very similar.  I'll have to read it more carefully later.

> The idea sounds good, it seems to reduce the problem to 
> subgraph-isomorphism (check if b is a subgraph of (g-b)) for which I 
> guess an algorithm similar to the one described by Jeremy Carroll for 
> graph-isomorphism [2] could be used.

 From a quick skim, it looks that that's only applicable for 
(graph-graph) isomorphism rather than entailment (in this case whether 
there is a subgraph isomorphic to another graph).  You could use any 
SPARQL implementation, really.  The blank nodes are the variables.

> However I don't see how your algorithm would successfully be applied to
> g:
> [
>  foaf:knows [ foaf:name "Jo];
>  foaf:knows [ foaf:name "Jo]
> ].

Right, I was wrong to only allow b to be a MSG.  So it's more 
complicated.  Here's a second try: Find all subgraphs b of g such that 
(g-b) entails b, which is to say that there is a mapping M from blank 
nodes *only* in b to nodes in (g-b) such that (g-b) contains M(b).  The 
difference is that if b contains a blank node that is also in (g-b), 
then it's effectively a constant when comparing b to (g-b).

Hopefully that's right.  b doesn't have to be a MSG, but some other 
constraints fall out of that.  b still has to have all of the statements 
about some blank node so that M does some mapping.  And every statement 
in b must contain a blank node only in b.  (A statement can't be in both 
b and (g-b), so M has to do some mapping for each statement so that M(b) 
might possibly be a subset of (g-b).)

Before there were at most N b's to look at (at most one MSG per blank 
node, some sharing MSGs), and now there are possibly 2^N (each b 
contains all of the statements, or not all, of each blank node).

Also the only b's worth considering are connected graphs (every node is 
reachable from every other node), since otherwise b is just the union of 
two smaller graphs which could be considered on their own.  I think that 
this means that all of the b's worth considering are subgraphs of MSGs, 
and further only need to be entailed by the MSG containing b (and not 
all of (g-b)), which narrows down things a lot.

So a new algorithm would be to find all MSGs b, make b lean by checking 
all 2^N relevant subgraphs (except the N here is just the blank nodes in 
b; or something more efficient), and then checking if this new b is 
entailed by (g-b).  If this is right.

- Joshua Tauberer

** Nothing Unreal Exists **

Received on Monday, 21 November 2005 16:40:35 UTC