# Re: RDF-Entailment: Remove duplicate anonymous resources - looking for an algorithm

```Joshua Tauberer schrieb:

> It might be useful at this point to formalize what the problem is.
> Reto said earlier a very general version of the problem: find for a
> graph g "the smallest graph g1, so that g entails g1 and g1 entails g."
>
> I think this might be too general unless we add that g1 is a subset of
> g.  (Otherwise, consider that I can write a single statement and then
> define the semantics of that statement outside of RDF to entail any
> graph.)

sound reasonable, still quite insecure about terminology, would this be
the same to say "... g rdf-entails g1 and g1 rdf-entails g"?

> So the general problem then is for graph g, find non-empty graphs a
> and b such that a union b = g and a entails b.  Then, b can be
> removed.  But splitting a graph into two parts is tricky because of
> blank nodes -- each blank node can appear in either a or b, but not
> both.  For every b, for every blank node in b, every statement in g
> mentioning that node must be in b.  (Someone named something like this
> a 'minimal self-contained graph', MSG.)

I have been looking at the concept of "rdf molecules" [1] which seems to
be very close to MSG. It distinguishs between "terminal molecules",
which are graphs in which every node is grounded (i.e. has an
unambiguous cross-model identity either because it's a literal value or
a named resource or because it is "functionally grounded" by an (i)fp)
and "contextual molecules" which contain anonymous nodes with an
identity arising from the context of the node, in a lossless
decomposition of a graph the contextual molecules contain all triples
containing their ungrounded nodes. In fact I was working on a
jena-implementation of this "rdf molecules" concept when I come across
the problem of keeping the contextual nodes lean, a bigger graph could
then be made lean by

- splitting it into molecules
- replacing contextual molecules by equivalent lean graphs
- removing duplicate molecules
- merge the molecules

Of course, this doesn't solve the problem, but its solution would have
to be applied to smaller graphs.

> Ignoring any type of inferencing and since a and b don't have any
> statements in common, a can only entail b if b is a subgraph of a but
> with one or more nodes in each statement replaced by other blank nodes
> from g -- b is just a weaker version of some part of a.  And, b must
> contain blank nodes.  (Okay, so you knew this already.)
>
> Finally, a crude algorithm would be to find all subgraphs b of g such
> that 1) every statement in b has a blank node, 2) b contains all
> statements in g mentioning blank nodes in b (so there is at most one b
> for every blank node), 3) (g-b) entails b, which is to say that there
> is a mapping M from blank nodes in b to nodes not in b such that (g-b)
> contains M(b).

The idea sounds good, it seems to reduce the problem to
subgraph-isomorphism (check if b is a subgraph of (g-b)) for which I
guess an algorithm similar to the one described by Jeremy Carroll for
graph-isomorphism [2] could be used.

However I don't see how your algorithm would successfully be applied to
g:
[
foaf:knows [ foaf:name "Jo];
foaf:knows [ foaf:name "Jo]
].

To comply with condition 2 b would have to contain all triples of g, or
am I too tired?

> That's all off the top of my head... Could be wrong.  And as for an
> efficient algorithm, I have no idea...
>
Thanks for sharing - currently I have no algorithm at all not even an
inefficient one :-(. The terribly inefficient algorithm I was playing
with had the bug that it wrongly reduced the graph
_:a foaf:knows _:b
_:b foaf:knows _:a

to

_:a foaf:knows _:a

reto

[1] http://www.ksl.stanford.edu/people/pp/papers/Ding_ISWC_2005.pdf
[2] /www.hpl.hp.com/techreports/2001/HPL-2001-293.html/
```

Received on Monday, 21 November 2005 15:32:04 UTC