Re: RDF-Entailment: Remove duplicate anonymous resources - looking for an algorithm from Reto Bachmann-Gmür on 2005-11-21 (semantic-web@w3.org from November 2005)

From: Reto Bachmann-Gmür <reto@gmuer.ch>
Date: Mon, 21 Nov 2005 19:56:30 +0100
To: Joshua Tauberer <tauberer@for.net>
CC: Andreas Andreakis <andreas.andreakis@gmx.de>, semantic-web at W3C <semantic-web@w3c.org>
Message-ID: <4382185E.9090605@gmuer.ch>

Joshua Tauberer schrieb:

>> The idea sounds good, it seems to reduce the problem to 
>> subgraph-isomorphism (check if b is a subgraph of (g-b)) for which I 
>> guess an algorithm similar to the one described by Jeremy Carroll for 
>> graph-isomorphism [2] could be used.
>
>
> From a quick skim, it looks that that's only applicable for 
> (graph-graph) isomorphism rather than entailment (in this case whether 
> there is a subgraph isomorphic to another graph).  

I was thinking at the possibility to assign up to four hashes per triple 
in the source graph (one based on both named/literal subject and object, 
two making on the two anonymous and another one with both anonymized), 
reduce the source graph to statements available in the target graph, and 
then refine hashes till the size of the source graph is near the size of 
the target graph.

> You could use any SPARQL implementation, really.  The blank nodes are 
> the variables.

really? I though real-world sparql implementations only support 
tree-style queries, if not there's indeed existing and presumably 
optimized code for subgraph isomorphism.

>> However I don't see how your algorithm would successfully be applied to
>> g:
>> [
>>  foaf:knows [ foaf:name "Jo];
>>  foaf:knows [ foaf:name "Jo]
>> ].
>
>
> Right, I was wrong to only allow b to be a MSG.  So it's more 
> complicated.  Here's a second try: Find all subgraphs b of g such that 
> (g-b) entails b, which is to say that there is a mapping M from blank 
> nodes *only* in b to nodes in (g-b) such that (g-b) contains M(b).  
> The difference is that if b contains a blank node that is also in 
> (g-b), then it's effectively a constant when comparing b to (g-b).
>
> Hopefully that's right.  b doesn't have to be a MSG, but some other 
> constraints fall out of that.  b still has to have all of the 
> statements about some blank node so that M does some mapping.  And 
> every statement in b must contain a blank node only in b.  (A 
> statement can't be in both b and (g-b), so M has to do some mapping 
> for each statement so that M(b) might possibly be a subset of (g-b).)
>
> Before there were at most N b's to look at (at most one MSG per blank 
> node, some sharing MSGs), and now there are possibly 2^N (each b 
> contains all of the statements, or not all, of each blank node).

> Also the only b's worth considering are connected graphs (every node 
> is reachable from every other node), since otherwise b is just the 
> union of two smaller graphs which could be considered on their own.  I 
> think that this means that all of the b's worth considering are 
> subgraphs of MSGs, and further only need to be entailed by the MSG 
> containing b (and not all of (g-b)), which narrows down things a lot.

Realize, that the approach I presented to use molecules to reduce the 
size of the graphs that have to be made lean doesn't work, as it doesn't 
take into account anonymous resources which are "duplicate" of named or 
literal resources. I guess the same problem applies for MSG, as they 
only include statements wit blank nodes, so they are the same as 
"maximum contextual molecules", except MSGs don't treat functionally 
grounded nodes particularly.

> So a new algorithm would be to find all MSGs b, make b lean by 
> checking all 2^N relevant subgraphs (except the N here is just the 
> blank nodes in b; or something more efficient), and then checking if 
> this new b is entailed by (g-b).  If this is right.

OK, so you're checking the lean b against the whole (g-b), we could 
check again the (connected subgraph of g containing b) - g, but clearly 
the connections may not be limited to those through anonymous nodes - 
this may not be a big advantage, as probably most graphs are connected 
(e.g. by the rdf:type properties to a narrow set of classes). How do I 
check the 2^N subgraphs when making b lean?

reto

reto

Received on Monday, 21 November 2005 18:56:51 UTC