Re: Blending RDF graphs from Florian Kleedorfer on 2017-11-21 (semantic-web@w3.org from November 2017)

From: Florian Kleedorfer <florian.kleedorfer@austria.fm>
Date: Tue, 21 Nov 2017 16:48:32 +0100
To: Martynas Jusevičius <martynas@atomgraph.com>
Cc: Paul Tyson <phtyson@sbcglobal.net>, Semantic Web <semantic-web@w3.org>
Message-ID: <2876a48def3e15aefb3c2eefc27059d3@austria.fm>
Hi,



Am 2017-11-21 16:02, schrieb Martynas Jusevičius:
> Hi again,
> 
> if you only need to "blend" 2 graphs at a time, with one typed
> resource in each, I guess you can join on type, again even with
> SPARQL. Can there be other commonality than type? What if you have
> multiple resources with the same type?

The individual graphs could be more complex, for example describing a 
goods transport including packaging, content, pickup time/location 
delivery time/location.
The setting you are referring to is what we would call matching. We 
assume for the blending of goals graphs that matching has already 
occurred: the two agents have already identified each other and now want 
to describe their transaction.
There aren't millions of graphs, there are only a handful (all the goals 
specified by both agents, in most cases probably one on each side).

> 
> This seems like a very limited use case to me, which does not scale. I
> was thinking more about a triplestore with thousands or millions of
> graphs where any number of rides might match drivers to passengers. In
> that case I don't think joining only on type can work.
> 
> I'm also quite sure (optimized) SPARQL will be faster than any
> rules/inference, and probably also custom algorithms, unless you use
> some specialized storage beyond triples.

I suspect the same, except that I am not (yet) convinced that I want to 
use SPARQL in the browser - which may be one platform we'd like to use.

> 
> On Tue, Nov 21, 2017 at 3:48 PM, Florian Kleedorfer
> <florian.kleedorfer@austria.fm> wrote:
> 
>> Hi Martynas,
>> 
>> I am not sure if I let the use case seem too narrow here. The taxi
>> case is meant as an example, but the two graphs describing the
>> agent's goals could contain any RDF. The commonality of all use
>> cases is that a goal by itself is always incomplete and needs
>> another agent's mutually compatible goal for being completed.
>> 
>> Is my problem really that I have nothing to join on? In the example,
>> there are the TBox resources that are shared (e.g., taxi:Ride) that
>> could be used for joining. I think in the case of the example below,
>> the CONSTRUCT approach is possible, as suggested and shown by Axel
>> Polleres off-list a few days ago as a pattern-based solution to this
>> problem. I would classify it as a knowledge-based 'blending'
>> approach because the target structure as well as the structure of
>> the two graphs we want to 'blend' need to be known beforehand, which
>> allows for substantial optimizations.
>> 
>> A downside the pattern-based approach, I suspect, is that it can
>> only handle expected cases. Or do you see a way to generate such a
>> SPARQL statement from the data ad-hoc?
>> 
>> As a side-note, I am not so sure if I would refer to SPARQL as
>> 'simple'. It's easy to write a simple SPARQL query, but it requires
>> resources to parse and execute it, and it may be more efficient to
>> reach the same result with a different algorithm, more tailored to
>> the problem.
>> 
>> Am 2017-11-21 14:34, schrieb Martynas Jusevičius:
>> 
>>> If understand your use case right, your problem is that you have
>>> nothing to join on, i.e. no common nodes, between the 2 rides.
>>> 
>>> If there was something in common, I think it could be solved with
>>> a
>>> simple SPARQL CONSTRUCT. Now what that should be, depends on your
>>> use
>>> case.
>>> 
>>> Maybe geohash of the location?
>>> https://en.wikipedia.org/wiki/Geohash [1]
>>> 
>>> In any case, I don't think it requires some kind of new graph
>>> "blending" algorithm.
>> 
>> I would be very relieved if I could take something off the shelf
>> here.
>> 
>> Best regards,
>> Florian
>> 
>> On Tue, Nov 21, 2017 at 12:40 PM, Florian Kleedorfer
>> <florian.kleedorfer@austria.fm> wrote:
>> 
>> Hi Paul,
>> 
>> I received a similar hint yesterday off-list. I am really grateful
>> for these suggestions, as I had not seen the possibility at all. It
>> seems we should really look into it.
>> 
>> You asked why I had not framed the problem more in terms of
>> description logics. The main reason is: I had not even considered
>> it, really. Here are some possible causes for that:
>> 
>> One thing to say about description logics is that it is one more
>> thing to learn, which makes it that much more unlikely a programmer
>> will use description logics over something else that gets the job
>> done. The graph blending approach is simpler in terms of logics but
>> you have to learn SHACL before you can use it - not sure about the
>> net trade-off value.
>> 
>> Also, I am a little bit afraid of OWL for the appliation I have in
>> mind, as it does not, in my (limited) experience, do well in the
>> presence of messy data. It is easy to formulate inconsistent data,
>> and to infer really unexpected things. The application - or rather,
>> the protocol - is completely open (anyone can say anything..
>> including incomplete or inconsistent things), so I would prefer to
>> remain on a less expressive level as much as possible, ideally doing
>> no reasoning at all.
>> 
>> I've seen SHACL work quite nicely in the browser, but I am not sure
>> about reasoners for OWL or OWL-DL. Because of the reservations
>> stated earlier, I must admit, I haven't checked. Doing the blending
>> in the browser may not be a necessity, but I would not want to close
>> that door just yet.
>> 
>> Finally, I have to admit, for the DL based approach, I'm not really
>> sure I would know even where to start. I'll be thankful for any help
>> here and I'll be happy to cooperate if somone else wants to work on
>> it!
>> 
>> I am also thinking about the implications of the blending approach
>> vs the DL based approach... The goal of blending was to connect two
>> graphs that each describe one half of the solution, like puzzle
>> pieces. The example I had in mind was: A says, "A is the passenger
>> of the taxi ride." B says, "B is the driver of the taxi ride." Join
>> the two structures by blending the "taxi ride" node and you get "A
>> is the passenger of the taxi ride that B is the driver of". It seems
>> to me that one can arrive at this result without special reasoning
>> rules that were designed for exactly this use case. A simple rdfs
>> vocabulary, some SHACL constraints, and the variable/constant
>> definition would be enough. All of these could be different in every
>> interaction, allowing for evolution of 'reasoning' as well as
>> evolution of the use of vocabularies. The DL approach may be very
>> powerful for premeditated scenarios, but I am not sure how well it
>> would work if agents want to interact in ways that were not foreseen
>> by the ontology design.
>> 
>> The approach we have outlined so far is more of a linked data
>> flavored approach than a semantic web one. For many simple cases, it
>> may be sufficient and work just fine. It seems more robust to me
>> (less prone to fail with messy data) than the DL based approach, and
>> it might be a good thing to have as a baseline. However, I can
>> imagine that both approaches will be supported at some point and
>> they can live next to each other: some agents may prefer to express
>> their goals in line with rdf blending, some with DL class
>> intersections, some may use both. In all cases, the net result
>> should be an RDF graph that the partners can agree to.
>> 
>> Best
>> Florian
>> 
>> Am 2017-11-21 03:54, schrieb Paul Tyson:
>> Hi Florian, very interesting problem. See my comments below.
>> 
>> On Mon, 2017-11-20 at 19:09 +0100, Florian Kleedorfer wrote:
>> (Note: I changed the subject of this thread following Pat Hayes'
>> suggestion [1])
>> 
>> After some on-list and off-list discussions, the problem is becoming
>> clearer. I'll try to state it again, along with some ideas for a
>> solution. Any comments, criticism, and suggestions to collaborate
>> are
>> highly welcome ;-)
>> 
>> # Blending:
>> ## Definition
>> Given two RDF graphs g1 and g2, blending refers to the pairwise
>> renaming
>> of resources (blank nodes or URI nodes) r1 in g1 and r2 in g2 such
>> that
>> one resource name (which may be r1, r2, or any other URI) is used
>> instead of r1 and r2. The nodes r1 and r2 are said to be blended
>> (ideas
>> for a better term, anyone?). The resulting graphs are merged
>> (avoiding
>> accidental blank node identification, see RDF 1.1 Semantics[2]). The
>> blending function b maps the input graphs to the set of all possible
>> output graphs.
>> 
>> I wonder why you do not frame the problem more in terms of
>> description
>> logics. "Blending" as you have described it seems to be a much
>> looser
>> (flexible?) way of finding the intersection of 2 classes. If the
>> provider has defined the classes of goods/services they can provide,
>> and
>> the consumer defines the class(es) he wishes to buy, then before
>> they
>> can do business they must find an instance of good or service that
>> falls
>> in the intersection of those classes.
>> 
>> Suppose the provider defines class P, the consumer class C. Then any
>> transaction between them would be in the class defined by
>> intersectionOf(P C). If this class is unsatisfiable (cannot have any
>> members, for instance if P has :price = 20 and C has :price < 15),
>> you
>> want an algorithm to define and rank ways in which the provider's
>> and
>> consumer's class definitions could be modified to make the
>> intersection
>> satisfiable. These "negotiating rules" (or "distance measurements")
>> are
>> the interesting part of the problem.
>> 
>> I do not know if there are implementable solutions to these problems
>> in
>> the description logics literature, but it might be worth a little
>> searching before building the graph-manipulation approach you have
>> outlined here.
>> 
>> Regards,
>> --Paul
>> 
>> ## Discussion
>> There are many ways to blend two graphs, and none of these ways are
>> deemed correct or incorrect.
>> We require this operation to combine two data structures held by
>> mutually independent agents that have certain expectations about the
>> blending solution. These expectations divide the set of possible
>> outcomes at least in acceptable and unacceptable ones, possibly even
>> in
>> a preference ranking. I'll try to formulate that as well in the
>> following.
>> 
>> # Node blendability
>> ## Rationale
>> One way of constraining the result is to restrict which nodes can be
>> blended. So far, we have identified three types: variable nodes,
>> constant nodes, and unblendable nodes. Each node in an RDF graph has
>> exactly one of the three types. A variable node may be blended and
>> its
>> name may be changed. A constant node may also be blended, but its
>> name
>> is not allowed to change. An unblendable node must not be blended.
>> 
>> One may note that blending g1 and g2 is equivalent to merging them
>> if no
>> variable nodes are present in either graph.
>> 
>> It is currently unclear how the constant/variable/unblendable
>> property
>> should be expressed. My intuition is that in most practical use
>> cases,
>> most nodes need to be unblendable (ontology/TBox resources,
>> properties
>> in general), therefore it would be more economical to assume nodes
>> to be
>> unblendable by default. Then, variable/constant nodes have to be
>> annotated as such explicitly.
>> Maybe it would be an option to make definitions by URI prefix, so we
>> can
>> define a complete vocabulary as unblendable with one statement.
>> (opinions on that?)
>> 
>> ## Definition
>> Given two graphs g1 and g2, a blendability-aware blending function
>> is a
>> blending function for g1 and g2 that only replaces resources that
>> have
>> not been declared as constant with respect to blending. The latter
>> is
>> done by specifying a triple "<graph> bl:containsConstant <resource>
>> ."
>> or "<graph> bl:containsVariable <resource> ." in an additional graph
>> passed to the blendability-aware blending function. Such a graph is
>> called a 'blendability graph'.
>> 
>> # Constraints on the blending solution:
>> ## Rationale
>> As explained earlier, there may be multiple parties that have
>> expectations for the blending solution. So far, we have talked about
>> the
>> originators of the data. There may also be one or more intended
>> consumers of the data. Nothing prevents the expectations from being
>> mutually incompatible (meaning that there exists no graph that
>> satisfies
>> them all) or that they are even unsatisfiable by themselves. It
>> would of
>> course be beneficial if such cases could be detected automaticall,
>> but
>> for now, let's just leave it at noting the possibility. Now, if the
>> information the agents provided, when considered in total, has
>> spurious
>> or conflicting information, or if information is missing (taking
>> their
>> respective expectations as a definition of what is acceptable), the
>> blending solution cannot meet the expectations. Nevertheless, it has
>> to
>> be computed, in order to be able to take the next step and "repair"
>> (better word, please) one or both graphs and try again.
>> 
>> It would be nice to use constraints for the following goals during
>> the
>> calculation of the blending solution:
>> 1. pruning wholly unacceptable solutions
>> 2. ranking unacceptable solutions by badness
>> 3. ranking acceptable-looking, but incomplete solutions ("so far so
>> good")
>> 
>> Constraints should allow us to identify an acceptable result, or to
>> select the best result of a number of unacceptable ones, and report
>> them
>> to users so that they can repair them.
>> 
>> However, if the expectations define an acceptable result, anything
>> else,
>> however close, is unacceptable, and in the absence of acceptable
>> results
>> we are left with unacceptable ones, some of which may be completely
>> bogus while others are almost acceptable. So, the three categories
>> above
>> cannot be separated, and the best we can hope for is reasonable
>> ranking
>> of unacceptable solutions.
>> 
>> For the conceptualization, we go with a simple SHACL-based approach.
>> Other systems (SPARQL, ShEx) are definitely possible as well.
>> 
>> ## Definition
>> The constraints-aware blending function is a blending function for
>> two
>> RDF graphs g1 and g2 and takes optional graph names s1,...,sN that
>> denote graphs which contain SHACL shapes constraining the result.
>> Each
>> blending solution is validated using s1,...,sN; the result of the
>> function is a mapping of the blending solutions to their respective
>> SHACL validation results, indexed by shape graph name (s1,...,sN). A
>> blending solution is called accepted with respect to a constraints
>> graph
>> s if it causes no SHACL validation results when validated against s.
>> A
>> blending solution is called accepted if it is accepted with respect
>> to
>> all specified shapes graphs.
>> 
>> ## Discussion
>> The main issue in this step seems to be the pruning and ordering of
>> unacceptable results. Blending functions may choose to drop blending
>> solutions if they are deemed too bad - but where should they draw
>> the
>> line and still report an unacceptable blending solution while
>> dropping
>> another?
>> 
>> With respect to one shapes graph, we could define that validation
>> result
>> graphs are ranked by maximum severity (1 sh:Violation is worse than
>> 1000
>> sh:Warnings), and within those classes by number of validation
>> results
>> (2 Violations and 1 Warnings is worse than 1 Violation and 1000
>> Warnings).
>> 
>> I'm suspecting that the validation results should allow us to prune
>> at
>> least some bad blending solutions if their validation results are
>> supersets of other solutions' validation results: if solution A has
>> validation results (R1) and solution B has validation result (R1,
>> R2) we
>> can say that solution B is a deterioration of solution A and should
>> not
>> be reported.
>> 
>> In addition to constraint-based pruning and ranking, it is of course
>> possible to perform reasoning-based pruning and validate the
>> blending
>> solution according to appropriate reasoning logics (RDFS, OWL, ...).
>> However, I am not sure as to the usefulness of this approach if no
>> solution is found that is consistent according to the logic.
>> 
>> # Graph blending algorithms
>> Let's assume for this section that we are blending two graphs g1 and
>> g2,
>> and we know we have variable nodes var(g1) and constant nodes
>> const(g1)
>> in g1, and var(g2) and const(g2) in g2.
>> 
>> ## Complete
>> The complete algorithm consists of enumerating all possible
>> solutions
>> For each n in var(g1), there are the following solutions:
>> * 1 solution in which n is not blended at all
>> * 1 solution for each m in var(g2), in which n is blended with m
>> * 1 solution for each m in const(g2), in which n is blended with m
>> symmetrically the same is true for each n in var(g2).
>> Removal of duplicate solutions yields the set of solutions.
>> 
>> ## Heuristic
>> We may interpret blending as a heuristic search and apply an A*-like
>> algorithm to finding good results, assessing the distance to the
>> goal by
>> evaluating the constraints, and the distance from the start by the
>> number of node blendings.
>> 
>> ## Knowledge-based
>> Exploiting knowledge about the actual structures in g1 and g2, the
>> heuristic approach could be enhanced to apply more appropriate
>> distance
>> estimates and select more promising paths first.
>> 
>> Links:
>> 1.
>> https://lists.w3.org/Archives/Public/semantic-web/2017Nov/0067.html
>> [2]
>> [1]
>> 2. https://www.w3.org/TR/rdf11-mt/ [3] [2]
>> 
>> Am 2017-11-16 19:37, schrieb Florian Kleedorfer:
>> Hi,
>> 
>> As you may recall, I've asked this list about
>> messaging/negotiating
>> using linked data[1], and it led me and my colleagues to a concept
>> of
>> RDF-based agreements[2]. (Thanks again for your input!)
>> 
>> Now it's time for us to define how the negotiating parties come up
>> with the data (set of rdf triples) that they agree on.
>> 
>> One Idea we have is that both particpants (P1,P2) specify 'goals',
>> each consisting of a SHACL shapes graph and a data graph. Two
>> goals
>> can be tested for compatibility by merging the two data graphs and
>> then testing if the resulting graph is valid given the shapes
>> defined
>> by both goals. The problem we face with this approach is merging
>> the
>> two data graphs. Consider the example in this gist:
>> 
>> https://gist.github.com/fkleedorfer/81b32e3235105a4a4743e9fa76ba9332
>> [4]
>> [3]
>> 
>> In that example, we would want to merge these two graphs:
>> 
>> ex1:p1g-data {
>> ex1:ride1 a taxi:Ride .
>> ex1:ride1 taxi:hasDriver ex1:p1 .
>> ex1:ride1 taxi:hasPickupLocation ex1:pickup1;
>> }
>> 
>> and
>> 
>> ex2:p2g-data {
>> ex2:myRide a taxi:Ride.
>> ex2:myRide taxi:hasClient ex2:p2 .
>> ex2:myRide taxi:hasPickupLocation ex2:myPickupLocation .
>> ex2:myPickupLocation a s:GeoCoordinates ;
>> s:latitude   "48.213814" ;
>> s:longitude  "16.340870" .
>> }
>> 
>> We want that merge to happen in such a way that the shapes in
>> ex1:p1g-shapes and ex2:p2g-shapes can be evaluated, which will
>> require
>> to merge some of the nodes. (In the example, we'll find out that a
>> taxi:hasPickupTime property is missing. One of the participants is
>> then to add that property, and the whole structure becomes valid
>> according to both goals' shapes.)
>> 
>> The question is how exactly to achieve that merge. Intuitively,
>> the
>> result in this case should be
>> 
>> :mergedNode1 a taxi:Ride .
>> :mergedNode1 taxi:hasDriver ex1:p1 . # note: p1 links her own
>> identifier to the structure
>> :mergedNode1 taxi:hasPickupLocation :mergedNode2;
>> :mergedNode1 a taxi:Ride.
>> :mergedNode1 taxi:hasClient ex2:p2 .
>> :mergedNode1 taxi:hasPickupLocation :mergedNode2 .
>> :mergedNode2 a s:GeoCoordinates ;
>> s:latitude   "48.213814" ;
>> s:longitude  "16.340870" .
>> 
>> and after removal of duplicate triples
>> 
>> :mergedNode1 a taxi:Ride .
>> :mergedNode1 taxi:hasDriver ex1:p1 .
>> :mergedNode1 taxi:hasClient ex2:p2 .
>> :mergedNode1 taxi:hasPickupLocation :mergedNode2 .
>> :mergedNode2 a s:GeoCoordinates ;
>> s:latitude   "48.213814" ;
>> s:longitude  "16.340870" .
>> 
>> (I currently don't care about the URIs of the mergedNodes - just
>> something one can mint when doing the merge)
>> 
>> Do you have any hints on how to perform such a merge reliably,
>> also
>> with more complex structures?
>> 
>> Thank you very much in advance!
>> 
>> Best regards,
>> Florian
>> 
>> Links:
>> 1. Negotiations discussion thread:
>> 
>> https://lists.w3.org/Archives/Public/semantic-web/2017Jul/0004.html
>> [5]
>> [4]
>> 2. Agreements paper:
>> http://ceur-ws.org/Vol-1934/contribution-07.pdf [6] [5]
> 
> Links:
> ------
> [1]
> https://lists.w3.org/Archives/Public/semantic-web/2017Nov/0067.html
> [2]
> [2] https://www.w3.org/TR/rdf11-mt/ [3]
> [3]
> https://gist.github.com/fkleedorfer/81b32e3235105a4a4743e9fa76ba9332
> [4]
> [4]
> https://lists.w3.org/Archives/Public/semantic-web/2017Jul/0004.html
> [5]
> [5] http://ceur-ws.org/Vol-1934/contribution-07.pdf [6]
> 
> 
> 
> Links:
> ------
> [1] https://en.wikipedia.org/wiki/Geohash
> [2] https://lists.w3.org/Archives/Public/semantic-web/2017Nov/0067.html
> [3] https://www.w3.org/TR/rdf11-mt/
> [4] 
> https://gist.github.com/fkleedorfer/81b32e3235105a4a4743e9fa76ba9332
> [5] https://lists.w3.org/Archives/Public/semantic-web/2017Jul/0004.html
> [6] http://ceur-ws.org/Vol-1934/contribution-07.pdf
Received on Tuesday, 21 November 2017 15:49:13 UTC