- From: Pat Hayes <phayes@ihmc.us>
- Date: Tue, 1 Aug 2006 14:33:23 -0700
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- Cc: Bijan Parsia <bparsia@cs.man.ac.uk>, Eric Prud'hommeaux <eric@w3.org>, Enrico Franconi <franconi@inf.unibz.it>
There have been rather strong claims made in recent emails about how bnodes in query patterns should (must?) be handled, in some cases accompanied by inappropriate condescending ad-hominem comments. I will say no more about the latter but focus on the former, as many of the technical claims being made do not seem to me to be justified on either semantic or pragmatic grounds; and because this issue has significant consequences for the design of SPARQL. First, a brief review of the issue as I understand it. (1) There is an obvious similarity, noted by several commentators, between a query pattern containing a bnode and a similar pattern with a variable in place of the bnode. Let me refer to these as the 'bnode pattern' and the 'variable pattern'. For simple (RDF/S) entailment there is very little to choose between them, since an instance of the variable pattern is found in the reference graph just when the bnode pattern is simply entailed by the (RDF/S closure of) the reference graph: so one can, in these cases, think of a bnode in a pattern as simply being a 'blank variable' to which a binding is not required to be delivered in the answer (but is of course required to exist.) And from there it is a short step to the idea of simply banning bnodes from query patterns altogether, with a concomitant simplification to the spec., particularly to the semantics. (<Aside> there are pragmatic arguments in favor of allowing them, but these have been disputed: but in any case, if bnodes and variables are essentially equivalent, this debate can be conducted entirely on pragmatic grounds, without getting muddled with semantic issues.</Aside>) (2) There is an objection to this comfortable picture, which is that for more expressive languages (OWL-DL), it is possible for a KB to entail an existential statement without there being any term in the KB for a variable to bind to; and in such a case, the natural way to account for this situation would be for the variable-pattern query to fail, but the corresponding bnode-pattern query to succeed. This would therefore require that the two kinds of pattern be given fundamentally different semantics: intuitively, one might say that a bnode means 'some thing exists', but a query variable means 'some named thing exists', where the presence of the name is what makes the answer binding possible; and a bnodeID does not count as a 'name' in the required sense. (3) This idea is given terminological flesh by the distinction, introduced in http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0192 between 'distinguished' and 'undistinguished' variables, and the observation that SPARQL currently uses 'semi-distinguished' variables (which can bind to both "names" and bnodeIDs). In http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0018 this distinction is justified on the grounds that bnodes are not "names", so that it would not be appropriate to supply a bnodeID as an answer binding to a query variable; or at any rate, as is well-known to the cognoscenti, this is simply not the way that things are done: http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0031. (4) None of the reasons cited in the previous paragraph seem to me to be convincing objections to the obvious reply to the objection in (2), which is that bnodeIDs *can* be given as bindings to query variables; and since this is so, any engine which can determine the entailment of a pattern with a bnode, could deliver such a binding as an answer to the similar query pattern with a variable in place of the bnode. That is, if ASK PATTERN[_:x] (with _:x scoped to the pattern) is entailed, then SELECT ?x PATTERN[?x] should succeed, possibly with ?x bound to some suitable bnodeID generated by the querying engine. Adopting this as a design principle for querying languages in the RDF family would, I suggest, bring some overall regularity and simplicity to the spectrum of SPARQL versions (with different values for the entailment and scoping set parameters.) The fact that bnodes exist in the language means that purely existential answers can indeed be given by the same variable-binding mechanism as is used to deliver name bindings. In other words, there should be no difference between 'a thing exists such that...' and 'a named thing exists such that...', since the "name" can itself be a bnodeID, if necessary one generated for the sole purpose of being a binding for the query variable. Let me call this the 'existential name principle', as it amounts to the claim that the RDF language family makes no distinction between existence and named existence; since if something exists, the presence of bnodes means that we can, in effect, validly generate a name for it. In the rest of this message I will defend this idea against the objections raised in the above messages, and try to distinguish this issue from others with which it might be confused. (5) One objection to this idea, cited earlier, is that it embodies a logical confusion: bnodes are existentially bound *variables*, and are therefore not logical names. Now, while it is of course true that bnodes and URIrefs/literals (which I will collectively call 'names' here) have subtly different truth conditions, they are not so different as this contrast suggests. First, calling one a 'name' and the other a 'variable' is merely a matter of surface terminology. Many logical syntaxes make a lexical distinction but this is not necessary, and many others, such as ISO Common Logic, do not: a 'variable' then is simply a name that is bound by a quantifier. Second, more to the point, they have essentially the same semantics: they both in effect assert that something exists, and use a lexical token to refer to it. (It is worth noting that there are logical syntactic conventions which do not use lexical tokens *of any kind* to show existence, such as Peirce's existential graphs.) The only difference is that the scope of a name token might extend beyond the local syntactic context, allowing other assertions to be made about the thing denoted by the name; whereas an existential 'variable' is a lexical token only within the context defined by the scope of the quantifier. In natural deduction logical systems, logical names and existentials are almost interconvertible, using the rules EI and EG: PHI[name] ----------------EG (exists (x)PHI[x]) (exists (x)PHI[x]) --------------------EI PHI[name*] when name* does not occur free in the derivation, ie is a 'new' name. The need to invoke a 'new' name in EI illustrates the point about scoping: the 'newness' is a guarantee that no larger scope can 'capture' this name. But once this matter is dealt with, there really is no logical difference at all between a name and an existentially bound 'variable'. (Another way to illustrate this is to consider alternative textbook variations of logic which dispense with names altogether, and only have quantified variables. Logical 'names' then are simply existential variables which are bound at the top lexical level; a typical database is considered to be a huge existential sentence.) Bnodes are in fact even more similar to RDF names, i.e. to URIreferences, than is usual in a logical notation; since bnodes, like URIrefs, have global scope. This might seem an odd claim, but it is both true and important. The basic RDF graph syntax model, on which the RDF semantics is defined, has *no notion at all* of syntactic scope. The existential interpretation of bnodes is given relative to an RDF graph, but an RDF graph is simply any set of RDF triples. The set of all RDF triples on the Web is an RDF graph, as is the set of all syntactically possible RDF triples, as is any singleton set containing a single RDF triple, or anything in between. There in nothing in the abstract graph syntax which allows one to indicate the edge or extent of an RDF graph: no bnode binders or quantifiers or scoping syntax devices. All scoping of bnode IDs - not bnodes themselves, but their identifiers in a concrete interchange syntax such as SPARQL or RDF/XML - is presumed to be done by external conventions arising from the lexical/syntactic/scoping rules of whatever surface syntax form is being used to convey the abstract syntax, and it is these external conventions which determine which sets of triples are being considered as 'graphs' in any particular RDF application. This was a deliberate design decision, but perhaps one that was not adequately explained. What it amounts to is a decision to trade issues of lexical syntax (name scope, name binders, the free/bound distinction, etc.) for a purely mathematical 'abstract syntax' which treats all items uniformly. Thus, the only meaningful question about bnodes in RDF graphs is whether they are the same or not. If two graphs contain the same bnodes, then the implied 'scope' of that bnode is the union of the graphs. Thus, questions of lexical or syntactic scope which usually arise when talking about existential 'variables' in a conventional syntax are replaced, in the RDF graph syntax, by simple questions of identity between bnodes, allowing the entire graph syntax to be described using elementary set theory. The resulting picture of the RDF graph syntax has both URIrefs and bnodes being treated almost identically: both have global scope, both are referring terms, and their role in the semantics is almost identical; but bnodesIDs are usually treated as having document scope, while URIrefs are lexical items with global scope in all Web applications. (6) Another objection seems to be that to allow variables to bind to bnodes somehow introduces great additional complexity to the query-answering process. It seems to me that this must be wrong, because one can define the variable-binding option in terms of the other. Suppose that a query engine is capable of generating answers to ASK queries containing bnodes, then it can deliver appropriate answers to SELECT queries by the following pseudocode, where (gensymBnode) generates a new bnodeID and P[a/b] means the pattern P with a replaced by b: if ( ASK P[(gensymBnode)/?x] ) then return (bind ?x (gensymBnode)) Of course, this needs to be written more carefully to account for multiple variables, but you get the idea. And as written this does require doing an ASK for each pattern of variables in the query; but this is at worst a small multiple linear extra cost, and need only be done in cases where a real SELECT query on the variable has failed; and I am sure a more efficient method can be defined in practice, since the derivation of the inferences supporting the existential query will in any case involve generating suitable 'names'. (I can kind of see that in the presence of elaborate equality reasoning, problems might arise in determining whether or not multiple bindings are really the same. But even here, the possibility of providing a mere existential name - a bnode - does seem to provide an opportunity for computational savings with a possible loss of completeness but without sacrificing correctness.) (7) Finally, there seems to be a misapprehension that because any graph entails infinitely many other bnode-rich graphs which have it as an instance, that to allow bnodes as bindings will somehow open an infinite floodgate of redundant answers. And this would be true, if the spec were written in a very naive way so that entailment alone were a sufficient criterion for correctness of an answer binding. But this, of course, is why the SPARQL specs are written in the way that they are, to restrict answer bindings to a particular scoping set of possible bindings - in the case of normative SPARQL, the URIs and Bnodes which actually occur in the scoping graph. The separation between the scoping graph and the scoping set was written into the spec to allow for alternative strategies for handling OWL. Bijan notes that the spec as written therefore allows 'OWL-DL-SPARQL' - which, to emphasize, is not itself defined by the spec - to restrict itself to bindiings to an arbitrarily small set of legal answer bindings, and therefore to restrict itself so as to disallow bindings to bnodes. This is technically correct, but wholly against the design intention of those definitions. It was never the intent of that construction to restrict the set of legal answer bindings: it was, rather, to allow the scoping set to contain *extra* bnodeIDs, precisely in order to overcome the objection raised in the second paragraph above, i.e. to allow the query engine to invent new bnodes when they are required in order to provide an answer binding to a query which has a purely existential answer, for which no bnode which actually occurs in the scoping graph is an appropriate answer binding. (8) Although the restricted interpretation (where a scoping set for OWL-DL queries must contain no bnodes) does not violate any actual technical requirement of the current SPARQL draft, I would argue strongly against adopting it as a standard, to the point that I would argue for pre-emptively rendering it illegal by suitable wording to the current spec. With the restricted interpretation, for example, we would have the absurd situation that answer bindings could be derived from a graph using RDF entailment which are not legal using OWL-DL entailment, even when the graph itself is legal OWL-DL/RDF. (9) A final quick word on the claim of prior art implicit in the wording of some of the earlier messages. The entire subject of RDF querying is too new for there to be a substantial body of accepted techniques, accepted ideas, etc.., ; so any appeal, explicit or implicit, to the 'way things are done', or to a body of expert knowledge not vouchsafed to us all but weighty in its implications, should be treated as a symptom of blowhardiness; or at the very least in order to be taken seriously needs to be backed up by actual citations and/or arguments. Comments, ripostes, objections, etc. welcome. Pat Hayes -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32502 (850)291 0667 cell phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Tuesday, 1 August 2006 21:33:37 UTC