Blending RDF graphs (was: Merging RDF graphs) from Florian Kleedorfer on 2017-11-20 (semantic-web@w3.org from November 2017)

From: Florian Kleedorfer <florian.kleedorfer@austria.fm>
Date: Mon, 20 Nov 2017 19:09:42 +0100
To: semantic-web@w3.org
Message-ID: <d33c5ceefcbfe7bd1cf0a767e09ac3ba@austria.fm>
(Note: I changed the subject of this thread following Pat Hayes' 
suggestion [1])

After some on-list and off-list discussions, the problem is becoming 
clearer. I'll try to state it again, along with some ideas for a 
solution. Any comments, criticism, and suggestions to collaborate are 
highly welcome ;-)

# Blending:
## Definition
Given two RDF graphs g1 and g2, blending refers to the pairwise renaming 
of resources (blank nodes or URI nodes) r1 in g1 and r2 in g2 such that 
one resource name (which may be r1, r2, or any other URI) is used 
instead of r1 and r2. The nodes r1 and r2 are said to be blended (ideas 
for a better term, anyone?). The resulting graphs are merged (avoiding 
accidental blank node identification, see RDF 1.1 Semantics[2]). The 
blending function b maps the input graphs to the set of all possible 
output graphs.

## Discussion
There are many ways to blend two graphs, and none of these ways are 
deemed correct or incorrect.
We require this operation to combine two data structures held by 
mutually independent agents that have certain expectations about the 
blending solution. These expectations divide the set of possible 
outcomes at least in acceptable and unacceptable ones, possibly even in 
a preference ranking. I'll try to formulate that as well in the 
following.


# Node blendability
## Rationale
One way of constraining the result is to restrict which nodes can be 
blended. So far, we have identified three types: variable nodes, 
constant nodes, and unblendable nodes. Each node in an RDF graph has 
exactly one of the three types. A variable node may be blended and its 
name may be changed. A constant node may also be blended, but its name 
is not allowed to change. An unblendable node must not be blended.

One may note that blending g1 and g2 is equivalent to merging them if no 
variable nodes are present in either graph.

It is currently unclear how the constant/variable/unblendable property 
should be expressed. My intuition is that in most practical use cases, 
most nodes need to be unblendable (ontology/TBox resources, properties 
in general), therefore it would be more economical to assume nodes to be 
unblendable by default. Then, variable/constant nodes have to be 
annotated as such explicitly.
Maybe it would be an option to make definitions by URI prefix, so we can 
define a complete vocabulary as unblendable with one statement. 
(opinions on that?)

## Definition
Given two graphs g1 and g2, a blendability-aware blending function is a 
blending function for g1 and g2 that only replaces resources that have 
not been declared as constant with respect to blending. The latter is 
done by specifying a triple "<graph> bl:containsConstant <resource> ." 
or "<graph> bl:containsVariable <resource> ." in an additional graph 
passed to the blendability-aware blending function. Such a graph is 
called a 'blendability graph'.


# Constraints on the blending solution:
## Rationale
As explained earlier, there may be multiple parties that have 
expectations for the blending solution. So far, we have talked about the 
originators of the data. There may also be one or more intended 
consumers of the data. Nothing prevents the expectations from being 
mutually incompatible (meaning that there exists no graph that satisfies 
them all) or that they are even unsatisfiable by themselves. It would of 
course be beneficial if such cases could be detected automaticall, but 
for now, let's just leave it at noting the possibility. Now, if the 
information the agents provided, when considered in total, has spurious 
or conflicting information, or if information is missing (taking their 
respective expectations as a definition of what is acceptable), the 
blending solution cannot meet the expectations. Nevertheless, it has to 
be computed, in order to be able to take the next step and "repair" 
(better word, please) one or both graphs and try again.

It would be nice to use constraints for the following goals during the 
calculation of the blending solution:
1. pruning wholly unacceptable solutions
2. ranking unacceptable solutions by badness
3. ranking acceptable-looking, but incomplete solutions ("so far so 
good")

Constraints should allow us to identify an acceptable result, or to 
select the best result of a number of unacceptable ones, and report them 
to users so that they can repair them.

However, if the expectations define an acceptable result, anything else, 
however close, is unacceptable, and in the absence of acceptable results 
we are left with unacceptable ones, some of which may be completely 
bogus while others are almost acceptable. So, the three categories above 
cannot be separated, and the best we can hope for is reasonable ranking 
of unacceptable solutions.

For the conceptualization, we go with a simple SHACL-based approach. 
Other systems (SPARQL, ShEx) are definitely possible as well.

## Definition
The constraints-aware blending function is a blending function for two 
RDF graphs g1 and g2 and takes optional graph names s1,...,sN that 
denote graphs which contain SHACL shapes constraining the result. Each 
blending solution is validated using s1,...,sN; the result of the 
function is a mapping of the blending solutions to their respective 
SHACL validation results, indexed by shape graph name (s1,...,sN). A 
blending solution is called accepted with respect to a constraints graph 
s if it causes no SHACL validation results when validated against s. A 
blending solution is called accepted if it is accepted with respect to 
all specified shapes graphs.

## Discussion
The main issue in this step seems to be the pruning and ordering of 
unacceptable results. Blending functions may choose to drop blending 
solutions if they are deemed too bad - but where should they draw the 
line and still report an unacceptable blending solution while dropping 
another?

With respect to one shapes graph, we could define that validation result 
graphs are ranked by maximum severity (1 sh:Violation is worse than 1000 
sh:Warnings), and within those classes by number of validation results 
(2 Violations and 1 Warnings is worse than 1 Violation and 1000 
Warnings).

I'm suspecting that the validation results should allow us to prune at 
least some bad blending solutions if their validation results are 
supersets of other solutions' validation results: if solution A has 
validation results (R1) and solution B has validation result (R1, R2) we 
can say that solution B is a deterioration of solution A and should not 
be reported.

In addition to constraint-based pruning and ranking, it is of course 
possible to perform reasoning-based pruning and validate the blending 
solution according to appropriate reasoning logics (RDFS, OWL, ...). 
However, I am not sure as to the usefulness of this approach if no 
solution is found that is consistent according to the logic.


# Graph blending algorithms
Let's assume for this section that we are blending two graphs g1 and g2, 
and we know we have variable nodes var(g1) and constant nodes const(g1) 
in g1, and var(g2) and const(g2) in g2.

## Complete
The complete algorithm consists of enumerating all possible solutions
For each n in var(g1), there are the following solutions:
* 1 solution in which n is not blended at all
* 1 solution for each m in var(g2), in which n is blended with m
* 1 solution for each m in const(g2), in which n is blended with m
symmetrically the same is true for each n in var(g2).
Removal of duplicate solutions yields the set of solutions.

## Heuristic
We may interpret blending as a heuristic search and apply an A*-like 
algorithm to finding good results, assessing the distance to the goal by 
evaluating the constraints, and the distance from the start by the 
number of node blendings.

## Knowledge-based
Exploiting knowledge about the actual structures in g1 and g2, the 
heuristic approach could be enhanced to apply more appropriate distance 
estimates and select more promising paths first.



Links:
1. https://lists.w3.org/Archives/Public/semantic-web/2017Nov/0067.html
2. https://www.w3.org/TR/rdf11-mt/




Am 2017-11-16 19:37, schrieb Florian Kleedorfer:
> Hi,
> 
> As you may recall, I've asked this list about messaging/negotiating
> using linked data[1], and it led me and my colleagues to a concept of
> RDF-based agreements[2]. (Thanks again for your input!)
> 
> Now it's time for us to define how the negotiating parties come up
> with the data (set of rdf triples) that they agree on.
> 
> One Idea we have is that both particpants (P1,P2) specify 'goals',
> each consisting of a SHACL shapes graph and a data graph. Two goals
> can be tested for compatibility by merging the two data graphs and
> then testing if the resulting graph is valid given the shapes defined
> by both goals. The problem we face with this approach is merging the
> two data graphs. Consider the example in this gist:
> 
> https://gist.github.com/fkleedorfer/81b32e3235105a4a4743e9fa76ba9332
> 
> In that example, we would want to merge these two graphs:
> 
> ex1:p1g-data {
>     ex1:ride1 a taxi:Ride .
>     ex1:ride1 taxi:hasDriver ex1:p1 .
>     ex1:ride1 taxi:hasPickupLocation ex1:pickup1;
> }
> 
> and
> 
> ex2:p2g-data {
>     ex2:myRide a taxi:Ride.
>     ex2:myRide taxi:hasClient ex2:p2 .
>     ex2:myRide taxi:hasPickupLocation ex2:myPickupLocation .
>     ex2:myPickupLocation a s:GeoCoordinates ;
>         s:latitude   "48.213814" ;
>         s:longitude  "16.340870" .
> }
> 
> We want that merge to happen in such a way that the shapes in
> ex1:p1g-shapes and ex2:p2g-shapes can be evaluated, which will require
> to merge some of the nodes. (In the example, we'll find out that a
> taxi:hasPickupTime property is missing. One of the participants is
> then to add that property, and the whole structure becomes valid
> according to both goals' shapes.)
> 
> The question is how exactly to achieve that merge. Intuitively, the
> result in this case should be
> 
>    :mergedNode1 a taxi:Ride .
>    :mergedNode1 taxi:hasDriver ex1:p1 . # note: p1 links her own
> identifier to the structure
>    :mergedNode1 taxi:hasPickupLocation :mergedNode2;
>    :mergedNode1 a taxi:Ride.
>    :mergedNode1 taxi:hasClient ex2:p2 .
>    :mergedNode1 taxi:hasPickupLocation :mergedNode2 .
>    :mergedNode2 a s:GeoCoordinates ;
>         s:latitude   "48.213814" ;
>         s:longitude  "16.340870" .
> 
> and after removal of duplicate triples
> 
>    :mergedNode1 a taxi:Ride .
>    :mergedNode1 taxi:hasDriver ex1:p1 .
>    :mergedNode1 taxi:hasClient ex2:p2 .
>    :mergedNode1 taxi:hasPickupLocation :mergedNode2 .
>    :mergedNode2 a s:GeoCoordinates ;
>         s:latitude   "48.213814" ;
>         s:longitude  "16.340870" .
> 
> (I currently don't care about the URIs of the mergedNodes - just
> something one can mint when doing the merge)
> 
> 
> Do you have any hints on how to perform such a merge reliably, also
> with more complex structures?
> 
> Thank you very much in advance!
> 
> Best regards,
> Florian
> 
> 
> Links:
> 1. Negotiations discussion thread:
> https://lists.w3.org/Archives/Public/semantic-web/2017Jul/0004.html
> 2. Agreements paper: http://ceur-ws.org/Vol-1934/contribution-07.pdf
Received on Monday, 20 November 2017 18:10:19 UTC