- From: Pat Hayes <phayes@ihmc.us>
- Date: Thu, 22 Sep 2005 17:03:08 -0500
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- Cc: Dan Connolly <connolly@w3.org>, Ian Horrocks <horrocks@cs.man.ac.uk>, Enrico Franconi <franconi@inf.unibz.it>, Peter F. Patel-Schneider <pfps@research.bell-labs.com>, Bijan Parsia <bparsia@isr.umd.edu>
Here is my attempt to summarize the main points that were agreed on,
more or less, in the telecon discussion on Tuesday. I may not have
got them all, so if others' memories differ, please correct me. Also,
this does not claim to be a transcript, and it does not cover all the
issues that got raised, and some of it reflects thinking Ive done
since the telecon, including a suggested rewording of the
'skolemization' description of bnode handling.
Enrico, there is a direct question for you at @@@ below.
-------
1. We can distinguish between the basic matching operation between a
simple query pattern and the KB, and the various filtering and
selections that may be done to the results of that basic operation,
which amount to treating the results as a table (or tables) and
manipulating it (them) in various ways, essentially algebraic. Most
of the discussion concerns that basic operation and how it should be
described. It is agreed that all the subsequent filtering or
value-checking operations on the result tables are NOT best regarded
as entailment.
2. Right now the basic operation is described as follows. A
substitution B is a mapping from variables of a pattern Q to URIrefs,
literals or bnodes, and an answer is a binding with B(Q) a subgraph
of KB. There seems to be a strongly and widely held opinion that it
would be less confusing, or at least helpful to many folk, to
describe this in terms of simple entailment rather than as matching
to a subgraph: an answer is a binding B such that B(Q) is simply
entailed by KB. The two are nearly equivalent: see below for more
discussion.
Moreover, many feel that this use of entailment language at the heart
of the specification would provide a better (possibly, an essential)
segway between SPARQL and future extensions of SPARQL to other
languages, by replacing simple or RDF entailment with more complex
notions of entailment, particularly OWL entailment.
2a. The question arose: if SPARQL can be used (perhaps in the future,
but let us look ahead) with various notions of entailment, should a
query be able, or be required, to specify the kind of entailment
intended? Although this was discussed only briefly, there was a
consensus that it would be acceptable for the entailment in use to be
specified as part of the 'service', identified by the URI currently
used to name the target graph. This means that the query language and
protocol would not need to be concerned directly with specifying
which entailment is supposed to be used. Note however that this does
go beyond the idea of this URI identifying a virtual RDF graph, since
for example OWL entailment cannot be naturally reduced to RDF
entailment from an OWL closure, even an infinite virtual closure. The
spec should therefore be written so that querying against a URI which
identifies an OWL-querying service can provide bindings B such that
B(Q) is OWL-entailed by the KB, not (as currently) bindings B such
that B(Q) is simply entailed by a virtual RDF graph.
The important theoretical point here is that reducing X-entailment to
simple entailment by an X-closure has proven to be a useful technique
for RDF and RDFS, but is much less naturally extendable to languages
like OWL which have a nontrivial disjunction or union operator. (We
might argue about whether it is always *possible* to gloss
X-entailment as simple entailment from an X-closure, but it seems
clear that even if it can be done, it will be far clearer and more
natural not to bother, but instead simply refer to bindings which
achieve OWL entailment.) We are indebted to Enrico for making this
point vividly clear with the 'little-house' (aka 'worker') example.
(I would suggest, and this is purely my own personal view, that we
can adopt a compromise here, in which SPARQL in its current release
will refer to simple entailment; the issue is pointed out; the actual
spec. refers to virtual graphs identified by URIs, and refers to the
RDF and RDFS closure lemmas; and the possibility of using URIs to
identify services which offer other kinds of entailment is pointed
out as a future extension path. This allows conforming
implementations which support only simple matching and perform no
inferences, and it allows more advanced implementations certainly up
to RDFS and perhaps OWL lite, and has a very simple and
straightforward extension path to the use of SPARQL with more
sophisticated OWL engines simply by rewriting part of the text of the
spec, without requiring changes to the actual SPARQL language or
protocol.)
3. As Enrico's document points out, in order to correctly describe
the current SPARQL answer extraction process as entailment, even
simple entailment, we need to be a little more explicit about the
treatment of bindings to bnodes. Currently we allow answer bindings
of variables to bnodes, in order to accommodate two rather different
cases. This may have given rise to some of the reported confusion,
because we do not explicitly distinguish the two cases; and in one of
them, our treatment of bnode bindings is puzzling.
The two cases can be described in terms of the scope of the bnodes in
an answer binding. (This way of phrasing it was not agreed during the
telecon, but bear with me.)
3a. One case, call this the remote or 'out-of-scope' case, is where
a bnode in an answer is simply taken to be an existential assertion
in the answer. The particular identifier used is scoped only to the
answer itself, and has no meaning in any subsequent queries, and need
not be the same bnode ID as was used in the graph. Queries cannot be
used to elicit further information about things that are denoted by
such bnode bindings, and there is no notion of the 'scope' of a
session. This treats the query instance B(Q) as a separate graph from
the KB. This is the natural way to think of query traffic between a
client and a remote server across an open network, where answers to
queries are treated as self-contained items of information.
3b. In the other case, call this the huddle, or 'in-scope', case, an
answer binding to a bnode is supposed to provide a bnode identifier
which can be used in subsequent queries, and still identifies the
same graph node. This case is appropriate when the query is seen as
part of a 'session', something like a private conversation between
client and server, with bnodeIDs being used throughout a session in
both answers and subsequent queries, and identifying the same entity
throughout. This case is appropriate, for example, when a client and
server are in close proximity within an application or system, even
if only for a short period. One can think of this logically as the
entire session taking place within the scope of the existential
variables in the KB, i.e. these scopes extend across the whole series
of queries and answers within a single session, so that the query
instance is part of the same graph as the KB. In this case, then, a
bnode identifier in an answer is playing a very similar role as a
URIreference (at least during the session), in that they are both
genuine names which are taken to be identifiers during the session
itself; and it is important to allow such bindings because there may
be further facts, which can be elicited by subsequent queries, which
allow the bnode to be distinguished from a URIref in the same answer.
(Example below.)
3c. The wording in the current spec is, I believe, the result of a
compromise which attempted to allow both these uses without actually
mentioning them explicitly, by being sufficiently permissive.
Unfortunately, without explanation, the result is confusing and hard
to interpret coherently in semantic terms.
4. The first, remote, case is naturally described in terms of simple
entailment. Think of the variables in a pattern as bnodes, scoped to
the query, then a simple query pattern *is* an RDF graph, and the
query succeeds just when the query graph is simply entailed by the
KB. The pattern variables can then be seen just a syntactic device
for identifying which aspects of the instantiation should be included
in the result tables.
5. The second, huddle, case, cannot be so directly described in terms
of entailment, but needs some twisting.
Enrico provides one way to say it, in terms of skolemization, i.e. in
terms of entailment by an alternative graph obtained by
systematically naming each bnode with a new URIreference, then
replacing those URIreferences with the original bnodes before
returning the answer bindings.
Other ways of describing it are possible. They are all equivalent, so
the choice between them is editorial/expositional.
We could describe this as saying that the pattern has an instance
which would be simply entailed by the graph when all the bnodes in
the graph are treated as being URIrefs. This says essentially the
same thing, but avoids the awkwardness of having to
reverse-substitute for the skolem constants; but it requires us to
explain what it means to 'treat' a bnode as a name.
Another alternative way, which avoids talk of names and skolemization
altogether, would be to say that the answer in this case is a binding
B such that KB simply entails the union (not the merge) of KB and
B(Q). Then the two cases are:
remote: KB simply entails B(Q)
huddle: KB simply entails (KB union B(Q))
Note that in the absence of bnodes, these would be identical.
This also illustrates the way that a series of queries in a single
session can be thought of as 'accumulating' within the scope of the
original graph, since union is associative, so querying with Q1 and
then with Q2 is just like querying with (Q1 union Q2). On this way of
describing it, we can think of each query as part of a larger query,
which is built up in stages during the session.
@@@Enrico, is that an acceptable re-drafting of your 'skolemization'
way of describing it? I think it will be easier for many people to
follow, and avoids us having to define a whole lot of syntactic
skolemization machinery.@@@)
6. The difference between the two is illustrated by the following example:
G1: {ex:a ex:p ex:c .}
G2: {ex:a ex:p ex:c .
ex:a ex:p _:y .}
G3: {ex:a ex:p ex:c .
ex:a ex:p _:y .
_:y ex:q ex:d .}
Each is a subgraph of the next; G1 and G3 are lean, G2 is not. G1 and
G2 simply entail each other and are entailed by, but do not entail,
G3.
With SPARQL as currently defined, the query pattern
[ ex:a ex:p ?v ]
will yield the answers:
G1: ex:c
G2: ex:c, _:y
G3: ex:c, _:y
Now, if we understand the query in the remote, out-of-scope case, it
can be argued that the latter two are redundant, since the second
binding to the bnode adds no useful information; and, further, that
it is inappropriate that two logically equivalent graphs (G1 and G2)
should give different answer bindings. But for the huddle, in-scope
case, it is important that the second binding for ?var is provided in
the G3 case, since it allows for a subsequent query pattern
[_:y ex:q ?w ]
to succeed with the binding ?w = ex:d, distinguishing _:y from ex:c
(which is not returned as an answer here), and producing a genuine
alternative pair of answers. In order to extract this data from the
graph in the out-of-scope case, one would need to use a more complex
query pattern for the second query, incorporating the first query and
identifying the co-reference explicitly:
[ex:a ex:p ?v
?v ex:q ?w ]
and as noted, we could think of the entire huddle session as building
up this query in parts, the use of the bnode 'name' from the first
part-query result in the second part-query providing the link between
the pieces of the query expressed here by the use of the same query
variable ?v
7. Another, related, question is whether or not logically equivalent
graphs should give identical answer bindings. We did not give this
issue as much telecon attention as it needed, mia culpa.
Peter P-S, if I understand his point, claims that implementations
should be free to store graphs in RDF-equivalent forms (in
particular, to convert graphs to their 'lean' subgraphs) without
affecting interoperability, so that it should not be possible for a
SPARQL query to distinguish G1 and G2.
On the telecon we discussed Enrico's suggestion that SPARQL answers
should have redundancies removed, so as to achieve this. It was noted
that this could be done either at the client or server side of the
transaction, and it was agreed that it should be a client-specifiable
option, if included, rather than mandatory (although this does not
seem to fully meet Peter's objection. Do I have this right??). Enrico
noted however that the operation cannot be done simply as a
client-side post-SPARQL 'tidying-up' operation, since the reduction
of the answer set after the bindings are first identified has a
knock-on effect on subsequent SPARQL processing and filtering. (There
is an example in the IRC log, which I cannot access at present.)
Thus, this option must be somehow incorporated into the SPARQL
definition, and SPARQL-defined syntax must be provided to allow users
to specify whether they want redundancies to be removed or not, if we
indeed choose to allow this as a user option.
Another possible option is to simply require that SPARQL treats all
KBs as being lean, but this would seem to impose too high a burden on
KB maintainers (and would not satisfy Peter's requirement that
implementations be free to choose lean or fat storage (?))
7a. However, there seem to be still some issues to discuss. It is
clearly possible for a query to distinguish G1 and G3, which are not
logically equivalent and both of which are lean; but if we apply
Enrico's redundancy removal to the answer sets obtained from the
first query, it will fail to distinguish G1 from G3. One can argue
that this is appropriate, or at least tolerable, for the 'remote'
case of querying, since the bnode binding in the answer provides no
useful extra information, and the distinction between the two graphs
is revealed by a more elaborate query. However, it can also be argued
that it is seems silly to require SPARQL applications to go to
considerable computational trouble to deliberately hide information
which could be genuinely useful, and which is at worst merely
redundant. And it is clearly not appropriate behavior for the
'huddle' case.
If instead of removing redundacies in the answer set, we simply
required the KB to 'act lean', then the results would be different:
G1: ex:c
G2: ex:c
G3: ex:c, _:y
satisfying Peter's indistinguishability criterion, and allowing the
distinction to be visible. But this adds a burden to implementations
of servers, which may be unacceptable for query services which
operate with very large data sets.
As this example shows, requiring an 'effectively lean' KB - which
must be done by the server - and removing redundancies in the answer
set - which can be done by the client - are not equivalent. This
point was not clear to me at the time of the telecon, but I do not
know if it was clear to everyone else. As these point did not emerge
(as far as I recall) in the telecon discussions, it probably needs
some more discussion.
Another possible option might be to treat a huddle 'session' as
similar to a single remote query, and have the option be that 'final'
answer tables be purged of redundancy, so that this purging would be
performed after each query match for the remote case, but only at the
end of a 'session' for the huddle case. Although the details would
need to be checked, this might respond sufficiently to Peter's
requirements while allowing 'redundant' bnodes to be used as handles
for subsequent queries during a huddle session. We could then
describe the huddle option simply as a way to build up a real query
in stages, and require real query results to be irredundant.
(My own view is that none of this redundancy-removal is necessary,
but I recognize that this does not seem to be the majority view.)
----
Sorry if this raises more questions than it answers.
Pat
--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 cell
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Thursday, 22 September 2005 22:03:22 UTC