thoughts from Tuesday telecon from Pat Hayes on 2005-09-22 (public-rdf-dawg@w3.org from July to September 2005)

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 22 Sep 2005 17:03:08 -0500
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Cc: Dan Connolly <connolly@w3.org>, Ian Horrocks <horrocks@cs.man.ac.uk>, Enrico Franconi <franconi@inf.unibz.it>, Peter F. Patel-Schneider <pfps@research.bell-labs.com>, Bijan Parsia <bparsia@isr.umd.edu>
Message-Id: <p06200706bf5629d75576@[10.100.0.9]>
Here is my attempt to summarize the main points that were agreed on, 
more or less, in the telecon discussion on Tuesday. I may not have 
got them all, so if others' memories differ, please correct me. Also, 
this does not claim to be a transcript, and it does not cover all the 
issues that got raised, and some of it reflects thinking Ive done 
since the telecon, including a suggested rewording of the 
'skolemization' description of bnode handling.

Enrico, there is a direct question for you at @@@ below.

-------

1. We can distinguish between the basic matching operation between a 
simple query pattern and the KB, and the various filtering and 
selections that may be done to the results of that basic operation, 
which amount to treating the results as a table (or tables) and 
manipulating it (them) in various ways, essentially algebraic. Most 
of the discussion concerns that basic operation and how it should be 
described. It is agreed that all the subsequent filtering or 
value-checking operations on the result tables are NOT best regarded 
as entailment.

2. Right now the basic operation is described as follows. A 
substitution B is a mapping from variables of a pattern Q to URIrefs, 
literals or bnodes, and an answer is a binding with B(Q) a subgraph 
of KB. There seems to be a strongly and widely held opinion that it 
would be less confusing, or at least helpful to many folk, to 
describe this in terms of simple entailment rather than as matching 
to a subgraph: an answer is a binding B such that B(Q) is simply 
entailed by KB. The two are nearly equivalent: see below for more 
discussion.
Moreover, many feel that this use of entailment language at the heart 
of the specification would provide a better (possibly, an essential) 
segway between SPARQL and future extensions of SPARQL to other 
languages, by replacing simple or RDF entailment with more complex 
notions of entailment, particularly OWL entailment.

2a. The question arose: if SPARQL can be used (perhaps in the future, 
but let us look ahead) with various notions of entailment, should a 
query be able, or be required, to specify the kind of entailment 
intended? Although this was discussed only briefly, there was a 
consensus that it would be acceptable for the entailment in use to be 
specified as part of the 'service', identified by the URI currently 
used to name the target graph. This means that the query language and 
protocol would not need to be concerned directly with specifying 
which entailment is supposed to be used. Note however that this does 
go beyond the idea of this URI identifying a virtual RDF graph, since 
for example OWL entailment cannot be naturally reduced to RDF 
entailment from an OWL closure, even an infinite virtual closure. The 
spec should therefore be written so that querying against a URI which 
identifies an OWL-querying service can provide bindings B such that 
B(Q) is OWL-entailed by the KB, not (as currently) bindings B such 
that B(Q) is simply entailed by a virtual RDF graph.

The important theoretical point here is that reducing X-entailment to 
simple entailment by an X-closure has proven to be a useful technique 
for RDF and RDFS, but is much less naturally extendable to languages 
like OWL which have a nontrivial disjunction or union operator. (We 
might argue about whether it is always *possible* to gloss 
X-entailment as simple entailment from an X-closure, but it seems 
clear that even if it can be done, it will be far clearer and more 
natural not to bother, but instead simply refer to bindings which 
achieve OWL entailment.) We are indebted to Enrico for making this 
point vividly clear with the 'little-house' (aka 'worker') example.

(I would suggest, and this is purely my own personal view, that we 
can adopt a compromise here, in which SPARQL in its current release 
will refer to simple entailment; the issue is pointed out; the actual 
spec. refers to virtual graphs identified by URIs, and refers to the 
RDF and RDFS closure lemmas; and the possibility of using URIs to 
identify services which offer other kinds of entailment is pointed 
out as a future extension path. This allows conforming 
implementations which support only simple matching and perform no 
inferences, and it allows more advanced implementations certainly up 
to RDFS and perhaps OWL lite, and has a very simple and 
straightforward extension path to the use of SPARQL with more 
sophisticated OWL engines simply by rewriting part of the text of the 
spec, without requiring changes to the actual SPARQL language or 
protocol.)

3. As Enrico's document points out, in order to correctly describe 
the current SPARQL answer extraction process as entailment, even 
simple entailment, we need to be a little more explicit about the 
treatment of bindings to bnodes. Currently we allow answer bindings 
of variables to bnodes, in order to accommodate two rather different 
cases. This may have given rise to some of the reported confusion, 
because we do not explicitly distinguish the two cases; and in one of 
them, our treatment of bnode bindings is puzzling.

The two cases can be described in terms of the scope of the bnodes in 
an answer binding. (This way of phrasing it was not agreed during the 
telecon, but bear with me.)

3a.  One case, call this the remote or 'out-of-scope' case, is where 
a bnode in an answer is simply taken to be an existential assertion 
in the answer. The particular identifier used is scoped only to the 
answer itself, and has no meaning in any subsequent queries, and need 
not be the same bnode ID as was used in the graph. Queries cannot be 
used to elicit further information about things that are denoted by 
such bnode bindings, and there is no notion of the 'scope' of a 
session. This treats the query instance B(Q) as a separate graph from 
the KB. This is the natural way to think of query traffic between a 
client and a remote server across an open network, where answers to 
queries are treated as self-contained items of information.

3b. In the other case, call this the huddle, or 'in-scope', case, an 
answer binding to a bnode is supposed to provide a bnode identifier 
which can be used in subsequent queries, and still identifies the 
same graph node. This case is appropriate when the query is seen as 
part of a 'session', something like a private conversation between 
client and server, with bnodeIDs being used throughout a session in 
both answers and subsequent queries, and identifying the same entity 
throughout. This case is appropriate, for example, when a client and 
server are in close proximity within an application or system, even 
if only for a short period. One can think of this logically as the 
entire session taking place within the scope of the existential 
variables in the KB, i.e. these scopes extend across the whole series 
of queries and answers within a single session, so that the query 
instance is part of the same graph as the KB. In this case, then, a 
bnode identifier in an answer is playing a very similar role as a 
URIreference (at least during the session), in that they are both 
genuine names which are taken to be identifiers during the session 
itself; and it is important to allow such bindings because there may 
be further facts, which can be elicited by subsequent queries, which 
allow the bnode to be distinguished from a URIref in the same answer. 
(Example below.)

3c. The wording in the current spec is, I believe, the result of a 
compromise which attempted to allow both these uses without actually 
mentioning them explicitly, by being sufficiently permissive. 
Unfortunately, without explanation, the result is confusing and hard 
to interpret coherently in semantic terms.

4. The first, remote, case is naturally described in terms of simple 
entailment. Think of the variables in a pattern as bnodes, scoped to 
the query, then a simple query pattern *is* an RDF graph, and the 
query succeeds just when the query graph is simply entailed by the 
KB. The pattern variables can then be seen just a syntactic device 
for identifying which aspects of the instantiation should be included 
in the result tables.

5. The second, huddle, case, cannot be so directly described in terms 
of entailment, but needs some twisting.

Enrico provides one way to say it, in terms of skolemization, i.e. in 
terms of entailment by an alternative graph obtained by 
systematically naming each bnode with a new URIreference, then 
replacing those URIreferences with the original bnodes before 
returning the answer bindings.

Other ways of describing it are possible. They are all equivalent, so 
the choice between them is editorial/expositional.

We could describe this as saying that the pattern has an instance 
which would be simply entailed by the graph when all the bnodes in 
the graph are treated as being URIrefs. This says essentially the 
same thing, but avoids the awkwardness of having to 
reverse-substitute for the skolem constants; but it requires us to 
explain what it means to 'treat' a bnode as a name.

Another alternative way, which avoids talk of names and skolemization 
altogether, would be to say that the answer in this case is a binding 
B such that KB simply entails the union (not the merge) of KB and 
B(Q). Then the two cases are:

remote: KB simply entails B(Q)
huddle: KB simply entails (KB union B(Q))

Note that in the absence of bnodes, these would be identical.
This also illustrates the way that a series of queries in a single 
session can be thought of as 'accumulating' within the scope of the 
original graph, since union is associative, so querying with Q1 and 
then with Q2 is just like querying with (Q1 union Q2). On this way of 
describing it, we can think of each query as part of a larger query, 
which is built up in stages during the session.

@@@Enrico, is that an acceptable re-drafting of your 'skolemization' 
way of describing it? I think it will be easier for many people to 
follow, and avoids us having to define a whole lot of syntactic 
skolemization machinery.@@@)

6. The difference between the two is illustrated by the following example:

G1: {ex:a ex:p ex:c .}

G2: {ex:a ex:p ex:c .
ex:a ex:p _:y .}

G3: {ex:a ex:p ex:c .
ex:a ex:p _:y .
_:y ex:q ex:d .}

Each is a subgraph of the next; G1 and G3 are lean, G2 is not. G1 and 
G2 simply entail each other and are entailed by, but do not entail, 
G3.

With SPARQL as currently defined, the query pattern

[ ex:a ex:p ?v ]

will yield the answers:

G1:  ex:c
G2:  ex:c,  _:y
G3:  ex:c,  _:y

Now, if we understand the query in the remote, out-of-scope case, it 
can be argued that the latter two are redundant, since the second 
binding to the bnode adds no useful information; and, further, that 
it is inappropriate that two logically equivalent graphs (G1 and G2) 
should give different answer bindings. But for the huddle, in-scope 
case, it is important that the second binding for ?var is provided in 
the G3 case, since it allows for a subsequent query pattern

[_:y ex:q ?w ]

to succeed with the binding ?w = ex:d, distinguishing _:y from ex:c 
(which is not returned as an answer here), and producing a genuine 
alternative pair of answers. In order to extract this data from the 
graph in the out-of-scope case, one would need to use a more complex 
query pattern for the second query, incorporating the first query and 
identifying the co-reference explicitly:

[ex:a ex:p ?v
?v ex:q ?w ]

and as noted, we could think of the entire huddle session as building 
up this query in parts, the use of the bnode 'name' from the first 
part-query result in the second part-query providing the link between 
the pieces of the query expressed here by the use of the same query 
variable ?v

7. Another, related, question is whether or not logically equivalent 
graphs should give identical answer bindings. We did not give this 
issue as much telecon attention as it needed, mia culpa.

Peter P-S, if I understand his point, claims that implementations 
should be free to store graphs in RDF-equivalent forms (in 
particular, to convert graphs to their 'lean' subgraphs) without 
affecting interoperability, so that it should not be possible for a 
SPARQL query to distinguish G1 and G2.

On the telecon we discussed Enrico's suggestion that SPARQL answers 
should have redundancies removed, so as to achieve this. It was noted 
that this could be done either at the client or server side of the 
transaction, and it was agreed that it should be a client-specifiable 
option, if included, rather than mandatory (although this does not 
seem to fully meet Peter's objection. Do I have this right??). Enrico 
noted however that the operation cannot be done simply as a 
client-side post-SPARQL 'tidying-up' operation, since the reduction 
of the answer set after the bindings are first identified has a 
knock-on effect on subsequent SPARQL processing and filtering. (There 
is an example in the IRC log, which I cannot access at present.) 
Thus, this option must be somehow incorporated into the SPARQL 
definition, and SPARQL-defined syntax must be provided to allow users 
to specify whether they want redundancies to be removed or not, if we 
indeed choose to allow this as a user option.

Another possible option is to simply require that SPARQL treats all 
KBs as being lean, but this would seem to impose too high a burden on 
KB maintainers (and would not satisfy Peter's requirement that 
implementations be free to choose lean or fat storage (?))

7a. However, there seem to be still some issues to discuss. It is 
clearly possible for a query to distinguish G1 and G3, which are not 
logically equivalent and both of which are lean; but if we apply 
Enrico's redundancy removal to the answer sets obtained from the 
first query, it will fail to distinguish G1 from G3. One can argue 
that this is appropriate, or at least tolerable, for the 'remote' 
case of querying, since the bnode binding in the answer provides no 
useful extra information, and the distinction between the two graphs 
is revealed by a more elaborate query. However, it can also be argued 
that it is seems silly to require SPARQL applications to go to 
considerable computational trouble to deliberately hide information 
which could be genuinely useful, and which is at worst merely 
redundant. And it is clearly not appropriate behavior for the 
'huddle' case.

If instead of removing redundacies in the answer set, we simply 
required the KB to 'act lean', then the results would be different:

G1:  ex:c
G2:  ex:c
G3:  ex:c,  _:y

satisfying Peter's indistinguishability criterion, and allowing the 
distinction to be visible. But this adds a burden to implementations 
of servers, which may be unacceptable for query services which 
operate with very large data sets.

As this example shows, requiring an 'effectively lean' KB - which 
must be done by the server - and removing redundancies in the answer 
set - which can be done by the client -  are not equivalent. This 
point was not clear to me at the time of the telecon, but I do not 
know if it was clear to everyone else. As these point did not emerge 
(as far as I recall) in the telecon discussions, it probably needs 
some more discussion.

Another possible option might be to treat a huddle 'session' as 
similar to a single remote query, and have the option be that 'final' 
answer tables be purged of redundancy, so that this purging would be 
performed after each query match for the remote case, but only at the 
end of a 'session' for the huddle case. Although the details would 
need to be checked, this might respond sufficiently to Peter's 
requirements while allowing 'redundant' bnodes to be used as handles 
for subsequent queries during a huddle session. We could then 
describe the huddle option simply as a way to build up a real query 
in stages, and require real query results to be irredundant.

(My own view is that none of this redundancy-removal is necessary, 
but I recognize that this does not seem to be the majority view.)


----

Sorry if this raises more questions than it answers.

Pat



-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 22 September 2005 22:03:22 UTC