- From: Enrico Franconi <franconi@inf.unibz.it>
- Date: Tue, 20 Sep 2005 00:55:09 +0200
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>, Pat Hayes <phayes@ihmc.us>
Hi. (This message summarises long discussions and studies we did in the last ten days with Sergio Tessaris, PFPS, Bijan.) In this long message I will argue: (a) how can we introduce the notion of entailment in SPARQL, and how should we change and/or understand the current definitions in order to have an entailment-based core (as opposed to subgraph matching); here we don't argue whether this is useful and how this is going to be used. This is the basis for the real definitions given in (b) and (c). (b) we show how it is possible to get the current understood and implemented behaviour of SPARQL once we have entailment as the basis of SPARQL. This gives a semantics to what we call "told-bnode redundant" answers, which behaves *exactly* like the current subgraph matching semantics. (c) we finally show how we can get pure entailment based answers as a simple extension of the case (b), so that we get correct RDF entailed answers, and we obtain a system which may scale up to OWL-DL querying. Overall, we hope that this message highlights the fact that it is possible to adopt an entailment based semantics for SPARQL that (i) maintains the properties of graph matching of the current understanding of SPARQL, and (ii) allows for future interoperability with OWL based languages. ------------------------------------------------------------------------ (a) ON CLOSURE VS ENTAILMENT [issue <http://www.w3.org/2001/sw/ DataAccess/issues#rdfSemantics>] * Definition of Matching for a basic graph pattern from <http://www.w3.org/TR/rdf-sparql-query/#BasicGraphPatternMatching>: Definition: Graph Matching A basic graph pattern [...] matches if S(GP) is an RDF graph and is a subgraph of G. This definition, as Bijan and Pat argued, assumes that G is already a closure of an original graph G0, and, as explicated in the mailing list, calculating G is "somebody else['s] business" <http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0380>. But what closure? There are several possibilities. First case (completion): One closure is the completion of G under the rdf or rdfs entailment rules from RDF Semantics, or just G itself (for simple entailment, which is what all the current implementations are looking at now). Because of the entailment lemmas for simple, rdf, and rdfs entailment in RDF Semantics, matching by means of the usual subgraph isomorphism (with special care of bnodes) works properly in the cases of simple, rdf, or rdfs entailment, but it does not work for any logic without the minimal model property (such as OWL-Lite/DL). In fact, for other entailments there may not be such a set of entailment rules leading to a unique model representative (see, e.g., the worker example in OWL-Lite <http://www.w3.org/2001/sw/DataAccess/issues#owlDisjunction>), so this approach is only guaranteed to work for simple, rdf, and rdfs entailment. As well, the closure of any graph under the rdf and rdfs entailment rules is always infinite (since there is the infinite rdf-vocabulary rdf:_1, rdf:_2, etc), so the closure cannot be directly computed; as a possible solution to this problem, we may easily assume that this part of the vocabulary is finite. Second case (deductive closure): Another closure is the deductive closure, G = graph-merge{ GPi | G0 entails GPi }. In principle, this one will always work, in that matching will then pick up all entailments, for whatever definition of entailment one wants. This basically emulates the idea of constructing the set of all deductions, and each solution S(GP) should be element of that set. Note that, even in the case of simple entailment, the deductive closure G is not just the completion of G0 as defined in RDF-MT. Consider the following example: G0 = { :a :P :a . }. The simple completion of G0 is just G0. OTOH, the simple deductive closure of G0 includes also all the infinitely many triples of the kind: { :a :P _:x . _:x :P :a } { :a :P _:x . _:x :P _:y . _:y :P :a } { :a :P _:x . _:x :P _:y . _:y :P _:z . _:z :P :a } ... Third case (entailment): It is easy to show that the definition of matching making use of the deductive closure is equivalent to the one which makes directly use of entailment (for a wide range of entailments including OWL-DL entailment): Definition: Entailment Matching A basic graph pattern GP matches on graph G with solution S if S(GP) is an RDF graph and is entailed by G. Note that there is no need to compute a closure here, and so no infinite graphs are involved in the definition. It is up to the appropriate definition of entailment to define when a GP is entailed by G, given their RDF linearisation. OUR CONCLUSIONS Both the definition of graph matching with deductive closure and the definition of graph matching based directly on entailment are obviously equivalent, and both correctly work with with RDF and OWL entailment (cfg., the "worker example" <http://www.w3.org/2001/sw/DataAccess/issues#owlDisjunction>). However, we strongly believe that the direct use of entailment in the definition provides a more intuitive and direct approach, and it has the advantage that it uses the original graph without any transformation. (Referring to <http://lists.w3.org/Archives/Public/www-archive/ 2005Sep/0009>, the query and its proof doesn't capture the main idea behind the "worker example". In fact, the graph in the message doesn't implies that :Andrea is an instance of :EMPLOYEE, so the tuple (:Paul :Andrea :Caroline) cannot be in the answer of the given query. Most likely, the theorem prover didn't consider the certain answers (those true in every model) but possible answers (as those true in at least one model). The only variable in the SELECT should be ?X, then the query returns :Paul because of reasoning by case on all the possible models of the graph.) ------------------------------------------------------------------------ ON REDUNDANCY OF TOLD BNODES IN ANSWERS [issue <http://www.w3.org/2001/sw/DataAccess/issues#rdfSemantics>] """ Should queries of non-lean and lean graphs that entail each other give the same answers? """ The answer to this question should be *yes*. See use case 1, "Publishing on the Web", in <http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0430>). This is also relevant, as noted by PFPS, to enable interoperability between different interoperating implementations of RDF. Please note that in the case of the relational model (e.g., SQL) this problem does not arise, since there is no equivalent to the bnodes in the relational data model, and redundancy never appears in the data and the answer sets of arbitrary relational algebra queries are always non redundant. Also note that the notion of SQL queries is perfectly formalised as entailment (after the seminal works by Ray Reiter in the late 70ies), and as a matter of fact when a notion similar to bnodes sneaked in the relational model in the form of null values, the logic based definition of query answering proved to be extremely useful to characterise the various notions of incompleteness implied by the null values. So, the peculiarity and interestingness of SPARQL lies in the fact that it allows bnodes in the answers. On the oher hand, in SPARQL there are cases where some controlled form of redundancy is desirable. According to the use case 2 "Building a Graph" (in <http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0430>) it may be necessary to get in the solution set also the redundant bnodes as they appear "redundantly" in the original non-lean graph. We call this "told bnodes redundancy". The current general understanding and the current implementations of SPARQL capture exactly this kind of redundancy. This behaviour can be seen as querying the syntactic form of the an RDF graph, and it is an important part of the motivations of SPARQL. We want to formalise this understanding first. This can be achieved by transforming all the bnodes in the original graph G into fresh skolem constants before evaluating any basic pattern query. The skolem constants appearing in the resulting solution set are transformed back into their original bnode names before starting the evaluation process by the SPARQL algebra. We observe that in RDF/S the only bnodes that will ever appear in the solution may appear only in correspondence of the bnodes in the original graph. We can show that the only bnodes that will appear in the answer will be the ones coming through the skolemisation process. Therefore, the definition of Pattern Solution <http://www.w3.org/TR/rdf-sparql-query/#PatternSolutions> should be changed to disallow the use of bnodes in the substitution: A pattern solution is a substitution function from a subset of the set of variables to the set of the IRIs and RDF Literals. In other words, we say that a solution S is a substitution of the variables in a query with only IRIs and literals, and it is in the solution set of a query GP to a graph G iff there exists a S' such that SK(G) entails S'(GP), and S = UNSK(S'(GP)), where SK is a skolemisation operation of the bnodes, and UNSK is its inverse (giving back the bnode names to the skolem constants). This guarantees that the solution set is always finite, and the only redundancy will be from the told bnodes. Note that it can be shown that this changed basic pattern evaluation procedure corresponds to the current SPARQL algorithms. This means that no change should be made to the current systems if they want to implement the semantics allowing for redundancy of told bnodes. Note also that in order not to change the original told names of those bnodes, it is required that the subsequent SPARQL algebra always preserves the bnode names. OUR CONCLUSIONS This is the semantics of the current SPARQL implementations, that allows for told-bnodes redundancy in the solution set. We also introduce (in the next section) the non-redundant semantics, that is defined by "minimising" the told-bnode redundant solution set. ------------------------------------------------------------------------ (c) ON REDUNDANCY AND MINIMALITY OF ANSWERS WITH BNODES [issue <http://www.w3.org/2001/ sw/DataAccess/issues#rdfSemantics>] """ Should queries of non-lean and lean graphs that entail each other give the same answers? """ As said before, the answer to this question should be *yes*. See use case 1, "Publishing on the Web", in <http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0430>). We present here how this can be achieved by requiring a form of minimality of the solution set at the level of basic query patterns evaluation after the computation of the told-bnode redundant solution set as shown before. There are two problems here: 1/ Pairwise redundancy between two independent solutions, as in G1 = { ex:a ex:b ex:c . _:a ex:b _:c . } GP = { ?a ex:b ?c . } The closure matching solutions are (and the entailment solutions include) S1 = { {?a/ex:a, ?c/ex:c} , {?a/ _:a, ?c/ _:c} } The second solution is redundant, in that it is simple-entailed by the first. 2/ Information present in the graph that is lost when solutions are considered independently, as in G2 = { ex:a ex:b ex:c . _:a ex:b _:c . ex:a' ex:b _:c . } GP = { ?a ex:b ?c . } The closure matching solutions are S2 = { {?a/ex:a, ?c/ex:c} , {?a/ _:a, ?c/ _:c} , {?a/ex:a', ?c/ _:c} } There is information present in this set of solutions that is lost if the blank nodes in the solution are considered to be scoped to each different substitution, and the solution reduced to S2' = { {?a/ex:a, ?c/ex:c}, {?a/ex:a', ?c/ _:c} } In order to provide a non-redundant semantics, we introduce the minimisation of the solution set. Given a set AS of solutions of a basic graph pattern GP (remember that a solution of a GP is an assignment to the n variables in GP), its FO translation FO(AS) is the FO formula formed by a conjunction of atoms, each one of them corresponding to a solution in AS encoded as n arguments of a special fixed predicate (say, S); the bnodes are globally existentially quantified (so the coreferences are preserved). For example: GP: { ?x ?y } AS: { <_:a, :b>, <:c, _:d>, <:e, _:d>, ... } FO(AS): \exists _:a, _:d, ... S(_:a, :b) \and S(:c, _:d) \and S(:e, _:d) \and ... The minimal solution set, is the solution set which is minimal w.r.t. the order induced by FO entailment among all possible subsets of AS. It can be proved that such minimal solution set exists and it is unique. The problem is NP-hard in the dimension of cycles with bnodes in AS. Comment: if bnodes are not appearing in the solution set, the process of minimisation is not required, since each solution is unique and the set is automatically non-redundant. Bnodes may appear in the solution set *only* if they appear in the original graph (in fact, bnodes in the solution set come only through the process of skolemisation). So, if the original grah is lean, we get a correct (i.e., sound and complete) non redundant solution set; if the original graph is not lean, we get a correct (i.e., sound and complete) finitely redundant solution set (whit the only redundancy of the told redundant nodes in G). Note that in order to keep the coreference among bnodes across different solutions also in the final graph obtained by the CONSTRUCT operation, it is required that the RDF-merge operation changes only the name of the bnodes in the CONSTRUCT template but not the bnode names from the solutions. The minimisation of the solution set basic query pattern provides the starting point for the evaluation of the SPARQL complex query. In fact, it can be observed that each composite SPARQL query is grounded on the evaluation of basic patterns. It can be shown that each SPARQL operator preserves the minimality of the solution set, but the SELECT and UNION operators (just like in SQL). For those two operators, we require the adoption of an optional keyword "DISTINCT" that forces the minimisation of the operator application. OUR CONCLUSIONS Minimisation on solutions sets of basic graph patterns should always be performed in the case one want a generic pure entailment based query answering mechanism, and one does not want the told-bnode redundancy; minimisation of SELECT and UNION operators is optional by means of the keyword DISTINCT. ------------------------------------------------------------------------
Received on Monday, 19 September 2005 22:55:50 UTC