Re: Final text for Basic Graph Patterns from Pat Hayes on 2006-01-17 (public-rdf-dawg@w3.org from January to March 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 17 Jan 2006 10:42:45 -0600
To: Enrico Franconi <franconi@inf.unibz.it>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230902bff1763e9b7f@[10.100.0.23]>
>On 15 Jan 2006, at 19:01, Seaborne, Andy wrote:
>
>>Enrico Franconi wrote:
>>
>>>The new proposal of Pat 
>>><http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0061.html> 
>>>does not work for any approach where bnodes 
>>>are implicit, and this happens not only for 
>>>OWL-Lite entailment, as we already pointed out 
>>>in 
>>><http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0064.html>, 
>>>but also for RDF entailment. For example, 
>>>given the graph
>>>     :john :age "25"^^xsd:decimal .
>>>the query
>>>     ASK { _:b rdf:type rdf:XMLLiteral }
>>>should clearly return YES if RDF entailment is considered.

I agree, and see below.

>>>However,
>>>according to the latest proposal from Pat the 
>>>answer to the query would be NO regardless of 
>>>the type of entailment considered. The reason 
>>>is that Pat considers bnodes in the query 
>>>restricted to be bound only to URIs and bnodes 
>>>explicitly appearing in the graph.

This restriction is to terms occurring in the 
scoping graph, not in the original graph. For the 
SPARQL case of simple entailment, the scoping 
graph is defined to be equivalent to the dataset 
graph, but for  RDF or RDFS entailment, the 
appropriate adjustment would be to define the 
scoping graph G' to be the RDF or RDFS closure of 
the data set graph G. The only general semantic 
condition is that the scoping graph must be 
entailment-equivalent to, i.e. entail and be 
entailed by, the dataset graph, using whatever is 
the appropriate notion of entailment. The only 
purpose of including this graph in the 
definition, bear in mind, is to be a 'context' 
which restricts and organizes the answer bnode 
bindings, and to minimize redundancy. But in any 
case, if an adjustment to the definition of 
scoping is required, it can be made 
appropriately, perhaps by relaxing the condition 
that bnodes bound to variables must actually 
occur in G'. I don't think this strict condition 
is going to work for OWL in any case, certainly 
not for OWL/RDF. Such adjustments are purely 
pragmatic, and do not reflect any underlying 
semantic change.

The main point regarding bnodes in answer 
patterns is an operational one. If an engine is 
capable of determining that an existential 
pattern is entailed, then it is surely capable of 
supplying a bnode binding to a variable in the 
place of the existential. To not do so would be 
gratuitously misleading as a query answer, since 
to supply such a binding is an almost cost-free 
operation. It seems to me that this principle 
should apply to queries relative to any language 
which uses bnodes/existentials. To refuse to 
supply an existential name as a binding, and 
report 'no answer', when an existential is 
entailed, is at least as misleading as supplying 
an existential when there is a real name in the 
database. This seems to me to be so misleading, 
in fact, that we should forbid it.

The suggested 'principle' is that if an query 
pattern Q containing a bnode succeeds, then the 
similar query pattern with a new query variable 
in place of the bnode must also succeed. The 
justification being that the engine can generate 
a new bnodeID to substitute for the variable, 
under the exact same circumstances that it can 
establish that the existential is entailed. (The 
work needed to establish entailment, in the 
'awkward' cases, so much exceeds the work 
required to generate a new bnodeID that the 
latter seems hardly worth arguing about.) SPARQL 
satisfies this trivially, but I suggest that we 
impose it as a requirement on all extensions.

>>This hinges on the different definitions of 'pattern solution'.
>>The scoping
>>graph proposal maps variables and bNodes; the ordered-merge proposal maps
>>just variables.  Each has consequences.
>
>The main consequence of Pat's latest proposal, 
>as pointed out by FUB, is that SPARQL can never 
>be extended - not even to RDF entailment - as 
>you can see from the example above.

"Never be" is a very strong claim, and in this 
case false. The notion of 'scoping graph' used in 
the proposal can be used as a parameter to adjust 
the definition to suit. The scoping graph 
provides the scope for the answer bindings. For 
more elaborate forms of entailment which allow 
entailments of existentials, this scope should be 
allowed to include terms which represent the 
extra objects whose existence is entailed.

>Moreover, if you want to map both variables and 
>bnodes, you better disallow bnodes in the query 
>and consider only variables in the query: there 
>wouldn't be any noticeable difference.

The fact that bnodes and variables are almost 
identical in function, so that bnodes in query 
patterns are 'blank variables', seems to me to be 
a feature rather than a bug. This corresponds 
exactly to the basic semantical picture in which 
a query is a sentence on the RHS of a sequent, 
i.e. a goal to be proved: a sentence being 
entailed rather than a sentence being asserted.

>>If bNodes are bound by pattern solutions, then the algebra works out as per
>>the current document.
>
>And bnodes would behave *exactly* as variables, 
>and therefore when considering the extension of 
>SPARQL with just RDF entailment, the semantics 
>would not be compatible, and the behaviour would 
>be wrong.

All, and I really do mean ALL, of these debates 
are concerned with syntactic devices whose only 
purpose is to restrict the answer set in order to 
avoid redundancy. This is a significant matter in 
query language design, but it has nothing to do 
with semantics. The only genuinely semantic 
constraint is that answer bindings should make 
the query pattern into something that is 
appropriately entailed by the dataset graph. All 
the rest of this entire discussion has been 
concerned with tweaking this basic, clear, 
semantic idea so as to restrict the answer set by 
eliminating irrelevant answers which are 
entailed, but which add no new information, i.e. 
are redundant.

My own view is that the costs and risks of trying 
to build appropriate such restrictions into a 
single definition of 'answer', which is supposed 
to survive unchanged through all possible 
entailment modes, are too high to justify this 
building-in, and that what we should do in 
stating the general extendable SPARQL conditions 
is, as suggested in 
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JanMar/0016, 
to separate out the logically necessary 
requirement on answers  which are universally 
agreed, clearly semantically motivated, and have 
a simple 'entail' parameter, so can be used 
without any change all the way from RDF to 
higher-order logic  from the conditions on 
answer sets which are there only to remove or 
minimize redundancy: and that we should document 
and explain this distinction between the 
different roles and purposes of the different 
parts of the definition, as a guide to future 
work. Who knows, some enterprising person might 
invent a clever hash-coding algorithm which can 
completely eliminate redundancy in answer sets 
in, say, log-n time in all practical cases, and 
put it in the public domain. If so, great, IMO: 
why should we define such a possibility out of 
consideration? (Which we do with the current 
definitions.) The point being that redundancy 
elimination is itself easy to state 
'semantically' as well, but (we all recognize) 
this easy definition defines a condition which is 
too onerous a computational burden to impose on 
developers and users right now. But this is not 
an implementational/pragmatic issue, not a 
semantic one: the semantic definition of 'less 
redundant' is about as clear and simple as it can 
get.  So, let us not impose it or try to build it 
into the definitions, but instead recommend 
(using 'should/should not' language) that it be 
approached as closely as reasonably, and leave it 
to implementers to make incremental improvements 
in how close they can get. The thing we are 
straining to get defined here is really just a 
specification of a particular implementational 
strategy for keeping redundancy to a minimum, and 
has nothing to do with semantics.

The absolute worst that can happen, under this 
proposal, is that two conforming engines will 
produce answer sets one of which will have some 
redundancies in it relative to the other. The 
work involved in checking whether two answer sets 
are equivalent under this condition is not vastly 
greater than that involved in checking whether 
they are the same under the current definition, 
since we do not specify a canonical ordering. 
None of the redundant answers will be wrong or 
semantically misleading: all the bnodes will be 
properly scoped.  This will also completely meet 
Peter's 'interoperability' wishes, if only in the 
ideal case, since entailment-minimal answer sets 
for logically equivalent datasets will always be 
graph-equivalent, for any notion of entailment. 
Users who are able and willing to check for 
redundancies in answer sets will be able to to do 
so, and those who do will get to the same answer 
set from any conforming engine. (Note, with our 
current definitions, in any of their 
incarnations, this is false. Completely 
redundancy-free sets of answer bindings will not 
in general be conforming answer sets, even when 
the dataset is lean.) In practice, any serious 
SPARQL query service will not generate 
gratuitously redundant answers, even if it could 
'legally' do so, since there are no 
Web-evolutionary pressures which lead in that 
direction: such pathological behavior would be to 
both the server's and the client's disadvantage.

The only objection to this idea that I have heard 
seems to be based on the feeling that there 
should be a single defined, official, unique set 
of answers. That idea makes sense for traditional 
data querying, but traditionally this kind of 
redundancy didn't come up. When, as here, both 
queries and datasets can legally have internal 
redundancy and so be logically equivalent to 
different queries and graphs, it seems silly to 
be sweating blood over how to ensure that answer 
sets can have just the 'right' amount of 
redundancy, and getting this defined so precisely 
that it gives a unique answer set for any graph, 
query and form of entailment. Whatever we do, we 
are almost bound to get it slightly wrong   I 
have no confidence that any of our suggested 
definitions will go on working appropriately for 
OWL  and we don't, I submit, need to get it 
exactly right. Worse, in advanced cases, we don't 
even know what really counts as 'exactly right', 
because the appropriate answer isn't determined 
by theoretical or semantic considerations, but by 
pragmatic ones. RDFS tautologies are certainly 
RDFS-entailed by the empty graph, for sure. So, 
should queries on the empty graph produce 
tautological bindings when used with RDFS 
entailment? Maybe, maybe not: I can see arguments 
both ways. If we leave the definition of 
'redundant' open enough to allow conforming 
engines to make the choice, then I see that as a 
small success for SPARQL, rather than a failure. 
(BTW, this is an argument for the 'scoping graph' 
idea, as it provides an extra adjustment 
parameter.)

So, I suggest that as a compromise that we define 
SPARQL answers using the 'bnode-binding' 
modification to the Baden-Baden definitions for 
the SPARQL case of simple entailment, but also 
state the more general, purely entailment-based, 
definitions as conditions which must be satisfied 
by any SPARQL extension to other entailment 
modes, and perhaps also add some informative 
remarks about appropriate strategies for keeping 
redundancy to a minimum. Seems to me that this 
keeps just the right amount of flexibility open 
for SPARQL-N (N>1) compliance, while keeping our 
current design deterministic and constrained, as 
Bijan wants it to be.

>>If bNodes are not bound at all, as in the ordered-merge proposal, then there
>>is a change to the algebra.
>
>Unless you forbid the use of bnodes in queries, 
>which wouldn't be at all a loss, since the 
>behaviour you want for bnodes is the behaviour 
>we have for variables.
>
>>(...)
>>
>>The algebra works if a bNode in occurring in two or more BGPs in a graph
>>pattern is bound by the pattern solution [*].
>
>Exactly: you expect bnodes to behave like 
>variables, which is semantically wrong for any 
>type of entailment which is not just the reading 
>of the syntax (like simple entailment).

I disagree that it is semantically wrong: on the 
contrary, thinking strictly about semantics, I 
cannot see that any other design makes semantic 
sense. Semantically, a query variable is a 
request for information about things that exist, 
and a bnode asserts existence. If we allow bnodes 
as bindings, then this gives information about 
the existence of a thing. This is exactly the 
same information provided by a YES to an 
existential query. Syntactic restrictions on 
bindings to terms in a graph are not by any 
stretch of the imagination properly called 
semantic restrictions.

The great merit of this bnode/variable scoping 
idea, it seems to me, is its simplicity, which 
includes its semantic simplicity. Semantically, 
bnodes - existentials - are the logically clear 
idea. Variables have no actual logical semantics 
at all, unless they are thought of as "labelled 
existentials". So the best way to think about it 
logically is that variables are being treated as 
bnodes (with trackable bindings), rather then 
vice versa.

Pat


>
>>This is a conservative approach and, like any extension to the level of
>>entailment, more solutions might become 
>>possible when relaxed.  It would allow
>>all entailments for queries consisting of a 
>>single BGP.  It would leave open the 
>>possibility of the rdfD1 use case above as an 
>>extension.
>
>This is acceptable only if bnodes are forbidden in queries.
>
>cheers
>--e.


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 17 January 2006 16:43:04 UTC