- From: Pat Hayes <phayes@ihmc.us>
- Date: Thu, 22 Sep 2005 17:03:08 -0500
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- Cc: Dan Connolly <connolly@w3.org>, Ian Horrocks <horrocks@cs.man.ac.uk>, Enrico Franconi <franconi@inf.unibz.it>, Peter F. Patel-Schneider <pfps@research.bell-labs.com>, Bijan Parsia <bparsia@isr.umd.edu>
Here is my attempt to summarize the main points that were agreed on, more or less, in the telecon discussion on Tuesday. I may not have got them all, so if others' memories differ, please correct me. Also, this does not claim to be a transcript, and it does not cover all the issues that got raised, and some of it reflects thinking Ive done since the telecon, including a suggested rewording of the 'skolemization' description of bnode handling. Enrico, there is a direct question for you at @@@ below. ------- 1. We can distinguish between the basic matching operation between a simple query pattern and the KB, and the various filtering and selections that may be done to the results of that basic operation, which amount to treating the results as a table (or tables) and manipulating it (them) in various ways, essentially algebraic. Most of the discussion concerns that basic operation and how it should be described. It is agreed that all the subsequent filtering or value-checking operations on the result tables are NOT best regarded as entailment. 2. Right now the basic operation is described as follows. A substitution B is a mapping from variables of a pattern Q to URIrefs, literals or bnodes, and an answer is a binding with B(Q) a subgraph of KB. There seems to be a strongly and widely held opinion that it would be less confusing, or at least helpful to many folk, to describe this in terms of simple entailment rather than as matching to a subgraph: an answer is a binding B such that B(Q) is simply entailed by KB. The two are nearly equivalent: see below for more discussion. Moreover, many feel that this use of entailment language at the heart of the specification would provide a better (possibly, an essential) segway between SPARQL and future extensions of SPARQL to other languages, by replacing simple or RDF entailment with more complex notions of entailment, particularly OWL entailment. 2a. The question arose: if SPARQL can be used (perhaps in the future, but let us look ahead) with various notions of entailment, should a query be able, or be required, to specify the kind of entailment intended? Although this was discussed only briefly, there was a consensus that it would be acceptable for the entailment in use to be specified as part of the 'service', identified by the URI currently used to name the target graph. This means that the query language and protocol would not need to be concerned directly with specifying which entailment is supposed to be used. Note however that this does go beyond the idea of this URI identifying a virtual RDF graph, since for example OWL entailment cannot be naturally reduced to RDF entailment from an OWL closure, even an infinite virtual closure. The spec should therefore be written so that querying against a URI which identifies an OWL-querying service can provide bindings B such that B(Q) is OWL-entailed by the KB, not (as currently) bindings B such that B(Q) is simply entailed by a virtual RDF graph. The important theoretical point here is that reducing X-entailment to simple entailment by an X-closure has proven to be a useful technique for RDF and RDFS, but is much less naturally extendable to languages like OWL which have a nontrivial disjunction or union operator. (We might argue about whether it is always *possible* to gloss X-entailment as simple entailment from an X-closure, but it seems clear that even if it can be done, it will be far clearer and more natural not to bother, but instead simply refer to bindings which achieve OWL entailment.) We are indebted to Enrico for making this point vividly clear with the 'little-house' (aka 'worker') example. (I would suggest, and this is purely my own personal view, that we can adopt a compromise here, in which SPARQL in its current release will refer to simple entailment; the issue is pointed out; the actual spec. refers to virtual graphs identified by URIs, and refers to the RDF and RDFS closure lemmas; and the possibility of using URIs to identify services which offer other kinds of entailment is pointed out as a future extension path. This allows conforming implementations which support only simple matching and perform no inferences, and it allows more advanced implementations certainly up to RDFS and perhaps OWL lite, and has a very simple and straightforward extension path to the use of SPARQL with more sophisticated OWL engines simply by rewriting part of the text of the spec, without requiring changes to the actual SPARQL language or protocol.) 3. As Enrico's document points out, in order to correctly describe the current SPARQL answer extraction process as entailment, even simple entailment, we need to be a little more explicit about the treatment of bindings to bnodes. Currently we allow answer bindings of variables to bnodes, in order to accommodate two rather different cases. This may have given rise to some of the reported confusion, because we do not explicitly distinguish the two cases; and in one of them, our treatment of bnode bindings is puzzling. The two cases can be described in terms of the scope of the bnodes in an answer binding. (This way of phrasing it was not agreed during the telecon, but bear with me.) 3a. One case, call this the remote or 'out-of-scope' case, is where a bnode in an answer is simply taken to be an existential assertion in the answer. The particular identifier used is scoped only to the answer itself, and has no meaning in any subsequent queries, and need not be the same bnode ID as was used in the graph. Queries cannot be used to elicit further information about things that are denoted by such bnode bindings, and there is no notion of the 'scope' of a session. This treats the query instance B(Q) as a separate graph from the KB. This is the natural way to think of query traffic between a client and a remote server across an open network, where answers to queries are treated as self-contained items of information. 3b. In the other case, call this the huddle, or 'in-scope', case, an answer binding to a bnode is supposed to provide a bnode identifier which can be used in subsequent queries, and still identifies the same graph node. This case is appropriate when the query is seen as part of a 'session', something like a private conversation between client and server, with bnodeIDs being used throughout a session in both answers and subsequent queries, and identifying the same entity throughout. This case is appropriate, for example, when a client and server are in close proximity within an application or system, even if only for a short period. One can think of this logically as the entire session taking place within the scope of the existential variables in the KB, i.e. these scopes extend across the whole series of queries and answers within a single session, so that the query instance is part of the same graph as the KB. In this case, then, a bnode identifier in an answer is playing a very similar role as a URIreference (at least during the session), in that they are both genuine names which are taken to be identifiers during the session itself; and it is important to allow such bindings because there may be further facts, which can be elicited by subsequent queries, which allow the bnode to be distinguished from a URIref in the same answer. (Example below.) 3c. The wording in the current spec is, I believe, the result of a compromise which attempted to allow both these uses without actually mentioning them explicitly, by being sufficiently permissive. Unfortunately, without explanation, the result is confusing and hard to interpret coherently in semantic terms. 4. The first, remote, case is naturally described in terms of simple entailment. Think of the variables in a pattern as bnodes, scoped to the query, then a simple query pattern *is* an RDF graph, and the query succeeds just when the query graph is simply entailed by the KB. The pattern variables can then be seen just a syntactic device for identifying which aspects of the instantiation should be included in the result tables. 5. The second, huddle, case, cannot be so directly described in terms of entailment, but needs some twisting. Enrico provides one way to say it, in terms of skolemization, i.e. in terms of entailment by an alternative graph obtained by systematically naming each bnode with a new URIreference, then replacing those URIreferences with the original bnodes before returning the answer bindings. Other ways of describing it are possible. They are all equivalent, so the choice between them is editorial/expositional. We could describe this as saying that the pattern has an instance which would be simply entailed by the graph when all the bnodes in the graph are treated as being URIrefs. This says essentially the same thing, but avoids the awkwardness of having to reverse-substitute for the skolem constants; but it requires us to explain what it means to 'treat' a bnode as a name. Another alternative way, which avoids talk of names and skolemization altogether, would be to say that the answer in this case is a binding B such that KB simply entails the union (not the merge) of KB and B(Q). Then the two cases are: remote: KB simply entails B(Q) huddle: KB simply entails (KB union B(Q)) Note that in the absence of bnodes, these would be identical. This also illustrates the way that a series of queries in a single session can be thought of as 'accumulating' within the scope of the original graph, since union is associative, so querying with Q1 and then with Q2 is just like querying with (Q1 union Q2). On this way of describing it, we can think of each query as part of a larger query, which is built up in stages during the session. @@@Enrico, is that an acceptable re-drafting of your 'skolemization' way of describing it? I think it will be easier for many people to follow, and avoids us having to define a whole lot of syntactic skolemization machinery.@@@) 6. The difference between the two is illustrated by the following example: G1: {ex:a ex:p ex:c .} G2: {ex:a ex:p ex:c . ex:a ex:p _:y .} G3: {ex:a ex:p ex:c . ex:a ex:p _:y . _:y ex:q ex:d .} Each is a subgraph of the next; G1 and G3 are lean, G2 is not. G1 and G2 simply entail each other and are entailed by, but do not entail, G3. With SPARQL as currently defined, the query pattern [ ex:a ex:p ?v ] will yield the answers: G1: ex:c G2: ex:c, _:y G3: ex:c, _:y Now, if we understand the query in the remote, out-of-scope case, it can be argued that the latter two are redundant, since the second binding to the bnode adds no useful information; and, further, that it is inappropriate that two logically equivalent graphs (G1 and G2) should give different answer bindings. But for the huddle, in-scope case, it is important that the second binding for ?var is provided in the G3 case, since it allows for a subsequent query pattern [_:y ex:q ?w ] to succeed with the binding ?w = ex:d, distinguishing _:y from ex:c (which is not returned as an answer here), and producing a genuine alternative pair of answers. In order to extract this data from the graph in the out-of-scope case, one would need to use a more complex query pattern for the second query, incorporating the first query and identifying the co-reference explicitly: [ex:a ex:p ?v ?v ex:q ?w ] and as noted, we could think of the entire huddle session as building up this query in parts, the use of the bnode 'name' from the first part-query result in the second part-query providing the link between the pieces of the query expressed here by the use of the same query variable ?v 7. Another, related, question is whether or not logically equivalent graphs should give identical answer bindings. We did not give this issue as much telecon attention as it needed, mia culpa. Peter P-S, if I understand his point, claims that implementations should be free to store graphs in RDF-equivalent forms (in particular, to convert graphs to their 'lean' subgraphs) without affecting interoperability, so that it should not be possible for a SPARQL query to distinguish G1 and G2. On the telecon we discussed Enrico's suggestion that SPARQL answers should have redundancies removed, so as to achieve this. It was noted that this could be done either at the client or server side of the transaction, and it was agreed that it should be a client-specifiable option, if included, rather than mandatory (although this does not seem to fully meet Peter's objection. Do I have this right??). Enrico noted however that the operation cannot be done simply as a client-side post-SPARQL 'tidying-up' operation, since the reduction of the answer set after the bindings are first identified has a knock-on effect on subsequent SPARQL processing and filtering. (There is an example in the IRC log, which I cannot access at present.) Thus, this option must be somehow incorporated into the SPARQL definition, and SPARQL-defined syntax must be provided to allow users to specify whether they want redundancies to be removed or not, if we indeed choose to allow this as a user option. Another possible option is to simply require that SPARQL treats all KBs as being lean, but this would seem to impose too high a burden on KB maintainers (and would not satisfy Peter's requirement that implementations be free to choose lean or fat storage (?)) 7a. However, there seem to be still some issues to discuss. It is clearly possible for a query to distinguish G1 and G3, which are not logically equivalent and both of which are lean; but if we apply Enrico's redundancy removal to the answer sets obtained from the first query, it will fail to distinguish G1 from G3. One can argue that this is appropriate, or at least tolerable, for the 'remote' case of querying, since the bnode binding in the answer provides no useful extra information, and the distinction between the two graphs is revealed by a more elaborate query. However, it can also be argued that it is seems silly to require SPARQL applications to go to considerable computational trouble to deliberately hide information which could be genuinely useful, and which is at worst merely redundant. And it is clearly not appropriate behavior for the 'huddle' case. If instead of removing redundacies in the answer set, we simply required the KB to 'act lean', then the results would be different: G1: ex:c G2: ex:c G3: ex:c, _:y satisfying Peter's indistinguishability criterion, and allowing the distinction to be visible. But this adds a burden to implementations of servers, which may be unacceptable for query services which operate with very large data sets. As this example shows, requiring an 'effectively lean' KB - which must be done by the server - and removing redundancies in the answer set - which can be done by the client - are not equivalent. This point was not clear to me at the time of the telecon, but I do not know if it was clear to everyone else. As these point did not emerge (as far as I recall) in the telecon discussions, it probably needs some more discussion. Another possible option might be to treat a huddle 'session' as similar to a single remote query, and have the option be that 'final' answer tables be purged of redundancy, so that this purging would be performed after each query match for the remote case, but only at the end of a 'session' for the huddle case. Although the details would need to be checked, this might respond sufficiently to Peter's requirements while allowing 'redundant' bnodes to be used as handles for subsequent queries during a huddle session. We could then describe the huddle option simply as a way to build up a real query in stages, and require real query results to be irredundant. (My own view is that none of this redundancy-removal is necessary, but I recognize that this does not seem to be the majority view.) ---- Sorry if this raises more questions than it answers. Pat -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32502 (850)291 0667 cell phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Thursday, 22 September 2005 22:03:22 UTC