Adoption of entailment in SPARQL

Hi.
(This message summarises long discussions and studies we did in the
  last ten days with Sergio Tessaris, PFPS, Bijan.)
In this long message I will argue:

(a) how can we introduce the notion of entailment in SPARQL, and how
     should we change and/or understand the current definitions in
     order to have an entailment-based core (as opposed to subgraph
     matching); here we don't argue whether this is useful and how this
     is going to be used. This is the basis for the real definitions
     given in (b) and (c).

(b) we show how it is possible to get the current understood and
     implemented behaviour of SPARQL once we have entailment as the
     basis of SPARQL. This gives a semantics to what we call
     "told-bnode redundant" answers, which behaves *exactly* like the
     current subgraph matching semantics.

(c) we finally show how we can get pure entailment based answers as a
     simple extension of the case (b), so that we get correct RDF
     entailed answers, and we obtain a system which may scale up to
     OWL-DL querying.

Overall, we hope that this message highlights the fact that it is
possible to adopt an entailment based semantics for SPARQL that (i)
maintains the properties of graph matching of the current
understanding of SPARQL, and (ii) allows for future interoperability
with OWL based languages.

------------------------------------------------------------------------

(a) ON CLOSURE VS ENTAILMENT

[issue <http://www.w3.org/2001/sw/ DataAccess/issues#rdfSemantics>]

* Definition of Matching for a basic graph pattern from
   <http://www.w3.org/TR/rdf-sparql-query/#BasicGraphPatternMatching>:

Definition: Graph Matching
	A basic graph pattern [...] matches if S(GP) is an RDF graph
	and is a subgraph  of G.

This definition, as Bijan and Pat argued, assumes that G is already
a closure of an original graph G0, and, as explicated in the mailing
list, calculating G is "somebody else['s] business"
<http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0380>.

But what closure?  There are several possibilities.

First case (completion):

One closure is the completion of G under the rdf or rdfs entailment
rules from RDF Semantics, or just G itself (for simple entailment,
which is what all the current implementations are looking at now).
Because of the entailment lemmas for simple, rdf, and rdfs entailment
in RDF Semantics, matching by means of the usual subgraph isomorphism
(with special care of bnodes) works properly in the cases of simple,
rdf, or rdfs entailment, but it does not work for any logic without
the minimal model property (such as OWL-Lite/DL).

In fact, for other entailments there may not be such a set of
entailment rules leading to a unique model representative (see, e.g.,
the worker example in OWL-Lite
<http://www.w3.org/2001/sw/DataAccess/issues#owlDisjunction>), so this
approach is only guaranteed to work for simple, rdf, and rdfs
entailment.  As well, the closure of any graph under the rdf and rdfs
entailment rules is always infinite (since there is the infinite
rdf-vocabulary rdf:_1, rdf:_2, etc), so the closure cannot be directly
computed; as a possible solution to this problem, we may easily assume
that this part of the vocabulary is finite.

Second case (deductive closure):

Another closure is the deductive closure,
	G = graph-merge{ GPi | G0 entails GPi }.
In principle, this one will always work, in that matching will then
pick up all entailments, for whatever definition of entailment one
wants. This basically emulates the idea of constructing the set of all
deductions, and each solution S(GP) should be element of that set.

Note that, even in the case of simple entailment, the deductive
closure G is not just the completion of G0 as defined in RDF-MT.
Consider the following example:
	G0 = { :a :P :a . }.
The simple completion of G0 is just G0. OTOH, the simple deductive
closure of G0 includes also all the infinitely many triples of the
kind:
	{  :a :P _:x . _:x :P  :a }
	{  :a :P _:x . _:x :P _:y . _:y :P  :a }
	{  :a :P _:x . _:x :P _:y . _:y :P _:z . _:z :P  :a }
	...

Third case (entailment):

It is easy to show that the definition of matching making use of the
deductive closure is equivalent to the one which makes directly use of
entailment (for a wide range of entailments including OWL-DL
entailment):

Definition: Entailment Matching
	A basic graph pattern GP matches on graph G with solution S if
	S(GP) is an RDF graph and is entailed by G.

Note that there is no need to compute a closure here, and so no
infinite graphs are involved in the definition. It is up to the
appropriate definition of entailment to define when a GP is entailed
by G, given their RDF linearisation.

OUR CONCLUSIONS

Both the definition of graph matching with deductive closure and the
definition of graph matching based directly on entailment are
obviously equivalent, and both correctly work with with RDF and OWL
entailment (cfg., the "worker example"
<http://www.w3.org/2001/sw/DataAccess/issues#owlDisjunction>).
However, we strongly believe that the direct use of entailment in the
definition provides a more intuitive and direct approach, and it has
the advantage that it uses the original graph without any
transformation.

(Referring to <http://lists.w3.org/Archives/Public/www-archive/
2005Sep/0009>, the query and its proof doesn't capture the main idea
behind the "worker example". In fact, the graph in the message
doesn't implies that :Andrea is an instance of :EMPLOYEE, so the
tuple (:Paul :Andrea :Caroline) cannot be in the answer of the given
query. Most likely, the theorem prover didn't consider the certain
answers (those true in every model) but possible answers (as those
true in at least one model). The only variable in the SELECT should
be ?X, then the query returns :Paul because of reasoning by case on
all the possible models of the graph.)

------------------------------------------------------------------------

ON REDUNDANCY OF TOLD BNODES IN ANSWERS
[issue <http://www.w3.org/2001/sw/DataAccess/issues#rdfSemantics>]

"""
Should queries of non-lean and lean graphs that entail each other
give the same answers?
"""

The answer to this question should be *yes*. See use case 1,
"Publishing on the Web", in
<http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0430>).
This is also relevant, as noted by PFPS, to enable interoperability
between different interoperating implementations of RDF.

Please note that in the case of the relational model (e.g., SQL) this
problem does not arise, since there is no equivalent to the bnodes in
the relational data model, and redundancy never appears in the data
and the answer sets of arbitrary relational algebra queries are always
non redundant. Also note that the notion of SQL queries is perfectly
formalised as entailment (after the seminal works by Ray Reiter in the
late 70ies), and as a matter of fact when a notion similar to bnodes
sneaked in the relational model in the form of null values, the logic
based definition of query answering proved to be extremely useful to
characterise the various notions of incompleteness implied by the null
values. So, the peculiarity and interestingness of SPARQL lies in the
fact that it allows bnodes in the answers.

On the oher hand, in SPARQL there are cases where some controlled form
of redundancy is desirable.  According to the use case 2 "Building a
Graph" (in
<http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0430>)
it may be necessary to get in the solution set also the redundant
bnodes as they appear "redundantly" in the original non-lean graph. We
call this "told bnodes redundancy". The current general understanding
and the current implementations of SPARQL capture exactly this kind of
redundancy. This behaviour can be seen as querying the syntactic form
of the an RDF graph, and it is an important part of the motivations of
SPARQL. We want to formalise this understanding first.

This can be achieved by transforming all the bnodes in the original
graph G into fresh skolem constants before evaluating any basic
pattern query. The skolem constants appearing in the resulting
solution set are transformed back into their original bnode names
before starting the evaluation process by the SPARQL algebra. We
observe that in RDF/S the only bnodes that will ever appear in the
solution may appear only in correspondence of the bnodes in the
original graph. We can show that the only bnodes that will appear in
the answer will be the ones coming through the skolemisation
process. Therefore, the definition of Pattern Solution
<http://www.w3.org/TR/rdf-sparql-query/#PatternSolutions> should be
changed to disallow the use of bnodes in the substitution:

         A pattern solution is a substitution function from a subset of
         the set of variables to the set of the IRIs and RDF Literals.

In other words, we say that a solution S is a substitution of the
variables in a query with only IRIs and literals, and it is in the
solution set of a query GP to a graph G iff there exists a S' such
that
SK(G)  entails  S'(GP),       and       S = UNSK(S'(GP)),
where SK is a skolemisation operation of the bnodes, and UNSK is its
inverse (giving back the bnode names to the skolem constants).

This guarantees that the solution set is always finite, and the only
redundancy will be from the told bnodes.
Note that it can be shown that this changed basic pattern evaluation
procedure corresponds to the current SPARQL algorithms. This means
that no change should be made to the current systems if they want to
implement the semantics allowing for redundancy of told bnodes.

Note also that in order not to change the original told names of those
bnodes, it is required that the subsequent SPARQL algebra always
preserves the bnode names.

OUR CONCLUSIONS

This is the semantics of the current SPARQL implementations, that
allows for told-bnodes redundancy in the solution set. We also
introduce (in the next section) the non-redundant semantics, that is
defined by "minimising" the told-bnode redundant solution set.

------------------------------------------------------------------------

(c) ON REDUNDANCY AND MINIMALITY OF ANSWERS WITH BNODES

[issue <http://www.w3.org/2001/ sw/DataAccess/issues#rdfSemantics>]

"""
Should queries of non-lean and lean graphs that entail each other
give the same answers?
"""

As said before, the answer to this question should be *yes*. See use
case 1, "Publishing on the Web", in
<http://lists.w3.org/Archives/Public/public-rdf-dawg/2005JulSep/0430>).
We present here how this can be achieved by requiring a form of
minimality of the solution set at the level of basic query patterns
evaluation after the computation of the told-bnode redundant solution
set as shown before.

There are two problems here:
1/ Pairwise redundancy between two independent solutions, as in
	G1 = { ex:a ex:b ex:c .
	        _:a ex:b  _:c . }
         GP = {   ?a ex:b   ?c . }
     The closure matching solutions are (and the entailment solutions
     include)
	S1 = { {?a/ex:a, ?c/ex:c} ,
	       {?a/ _:a, ?c/ _:c} }
     The second solution is redundant, in that it is simple-entailed by
     the first.
2/ Information present in the graph that is lost when solutions are
    considered independently, as in
	G2 = { ex:a  ex:b ex:c .
	        _:a  ex:b  _:c .
	       ex:a' ex:b  _:c . }
         GP = {   ?a  ex:b   ?c . }
     The closure matching solutions are
	S2 = { {?a/ex:a,  ?c/ex:c} ,
	       {?a/ _:a,  ?c/ _:c} ,
	       {?a/ex:a', ?c/ _:c} }
     There is information present in this set of solutions that is lost
     if the blank nodes in the solution are considered to be scoped to
     each different substitution, and the solution reduced to
	S2' = { {?a/ex:a,  ?c/ex:c},
		{?a/ex:a', ?c/ _:c} }

In order to provide a non-redundant semantics, we introduce the
minimisation of the solution set.
	Given a set AS of solutions of a basic graph pattern GP
	(remember that a solution of a GP is an assignment to the n
	variables in GP), its FO translation FO(AS) is the FO formula
	formed by a conjunction of atoms, each one of them
	corresponding to a solution in AS encoded as n arguments of a
	special fixed predicate (say, S); the bnodes are globally
	existentially quantified (so the coreferences are preserved).

For example:
GP: { ?x ?y }
AS: { <_:a, :b>, <:c, _:d>, <:e, _:d>, ... }
FO(AS): \exists _:a, _:d, ...
             S(_:a, :b) \and S(:c, _:d) \and S(:e, _:d) \and ...

The minimal solution set, is the solution set which is minimal w.r.t.
the order induced by FO entailment among all possible subsets of AS.
It can be proved that such minimal solution set exists and it is
unique. The problem is NP-hard in the dimension of cycles with bnodes
in AS.

Comment: if bnodes are not appearing in the solution set, the process
of minimisation is not required, since each solution is unique and the
set is automatically non-redundant. Bnodes may appear in the solution
set *only* if they appear in the original graph (in fact, bnodes in
the solution set come only through the process of skolemisation).  So,
if the original grah is lean, we get a correct (i.e., sound and
complete) non redundant solution set; if the original graph is not
lean, we get a correct (i.e., sound and complete) finitely redundant
solution set (whit the only redundancy of the told redundant nodes in
G).

Note that in order to keep the coreference among bnodes across
different solutions also in the final graph obtained by the CONSTRUCT
operation, it is required that the RDF-merge operation changes only
the name of the bnodes in the CONSTRUCT template but not the bnode
names from the solutions.

The minimisation of the solution set basic query pattern provides the
starting point for the evaluation of the SPARQL complex query. In
fact, it can be observed that each composite SPARQL query is grounded
on the evaluation of basic patterns. It can be shown that each SPARQL
operator preserves the minimality of the solution set, but the SELECT
and UNION operators (just like in SQL). For those two operators, we
require the adoption of an optional keyword "DISTINCT" that forces
the minimisation of the operator application.

OUR CONCLUSIONS

Minimisation on solutions sets of basic graph patterns should always
be performed in the case one want a generic pure entailment based
query answering mechanism, and one does not want the told-bnode
redundancy; minimisation of SELECT and UNION operators is optional by
means of the keyword DISTINCT.

------------------------------------------------------------------------

Received on Monday, 19 September 2005 22:55:50 UTC