Re: RDF-Entailment: Remove duplicate anonymous resources - looking for an algorithm

Joshua Tauberer schrieb:

> It might be useful at this point to formalize what the problem is.  
> Reto said earlier a very general version of the problem: find for a 
> graph g "the smallest graph g1, so that g entails g1 and g1 entails g."
>
> I think this might be too general unless we add that g1 is a subset of 
> g.  (Otherwise, consider that I can write a single statement and then 
> define the semantics of that statement outside of RDF to entail any 
> graph.)

sound reasonable, still quite insecure about terminology, would this be 
the same to say "... g rdf-entails g1 and g1 rdf-entails g"?

> So the general problem then is for graph g, find non-empty graphs a 
> and b such that a union b = g and a entails b.  Then, b can be 
> removed.  But splitting a graph into two parts is tricky because of 
> blank nodes -- each blank node can appear in either a or b, but not 
> both.  For every b, for every blank node in b, every statement in g 
> mentioning that node must be in b.  (Someone named something like this 
> a 'minimal self-contained graph', MSG.)

I have been looking at the concept of "rdf molecules" [1] which seems to 
be very close to MSG. It distinguishs between "terminal molecules", 
which are graphs in which every node is grounded (i.e. has an 
unambiguous cross-model identity either because it's a literal value or 
a named resource or because it is "functionally grounded" by an (i)fp) 
and "contextual molecules" which contain anonymous nodes with an 
identity arising from the context of the node, in a lossless 
decomposition of a graph the contextual molecules contain all triples 
containing their ungrounded nodes. In fact I was working on a 
jena-implementation of this "rdf molecules" concept when I come across 
the problem of keeping the contextual nodes lean, a bigger graph could 
then be made lean by

- splitting it into molecules
- replacing contextual molecules by equivalent lean graphs
- removing duplicate molecules
- merge the molecules

Of course, this doesn't solve the problem, but its solution would have 
to be applied to smaller graphs.

> Ignoring any type of inferencing and since a and b don't have any 
> statements in common, a can only entail b if b is a subgraph of a but 
> with one or more nodes in each statement replaced by other blank nodes 
> from g -- b is just a weaker version of some part of a.  And, b must 
> contain blank nodes.  (Okay, so you knew this already.)
>
> Finally, a crude algorithm would be to find all subgraphs b of g such 
> that 1) every statement in b has a blank node, 2) b contains all 
> statements in g mentioning blank nodes in b (so there is at most one b 
> for every blank node), 3) (g-b) entails b, which is to say that there 
> is a mapping M from blank nodes in b to nodes not in b such that (g-b) 
> contains M(b).

The idea sounds good, it seems to reduce the problem to 
subgraph-isomorphism (check if b is a subgraph of (g-b)) for which I 
guess an algorithm similar to the one described by Jeremy Carroll for 
graph-isomorphism [2] could be used.

However I don't see how your algorithm would successfully be applied to
g:
[
  foaf:knows [ foaf:name "Jo];
  foaf:knows [ foaf:name "Jo]
].

To comply with condition 2 b would have to contain all triples of g, or 
am I too tired?

> That's all off the top of my head... Could be wrong.  And as for an 
> efficient algorithm, I have no idea...
>
Thanks for sharing - currently I have no algorithm at all not even an 
inefficient one :-(. The terribly inefficient algorithm I was playing 
with had the bug that it wrongly reduced the graph
_:a foaf:knows _:b
_:b foaf:knows _:a

to

_:a foaf:knows _:a


reto


[1] http://www.ksl.stanford.edu/people/pp/papers/Ding_ISWC_2005.pdf
[2] /www.hpl.hp.com/techreports/2001/HPL-2001-293.html/

Received on Monday, 21 November 2005 15:32:04 UTC