Re: SPARQL semantics: open issues for basic query patterns from Sergio Tessaris on 2006-01-05 (public-rdf-dawg@w3.org from January to March 2006)

From: Sergio Tessaris <tessaris@inf.unibz.it>
Date: Thu, 5 Jan 2006 12:19:56 +0100
To: Pat Hayes <phayes@ihmc.us>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>, Dan Connolly <connolly@w3.org>, Enrico Franconi <franconi@inf.unibz.it>
Message-Id: <1E630A46-6AC8-46C5-BF34-BB29592410B2@inf.unibz.it>
Pat, I don't think that redundancy is the only beast lurking in the  
dark corners of the definitions. There is a general problem of what a  
bnode stands for in a pattern solution and which is its relation with  
other bnodes in the answer set.

In fact, bnodes in answer sets are really useful only to discover co- 
references among different pattern solutions (or even within the same  
one). This means that unless a bnode name is used in the "same way"  
across different entailment checks, the user cannot trust bnodes in  
an answer set.

By checking the entailment in the naive way, the bnode names are  
scoped to a single solution pattern, and not across an answer set.

Consider the following dataset:
  _:a :r _:a
  _:a :p _:b
  _:b :url <http://example.org>

and the query:
  SELECT * { ?x ?p ?y }

Under the G |= S(P) semantics, the server is allowed to return an  
answer set like:

?x    ?p   ?y
----------------
_:b12 :r   _:b13
_:b26 :p   _:b27
_:b40 :url <http://example.org>

but even the unsettling

?x    ?p   ?y
----------------
_:b12 :r   _:b13
_:b26 :p   _:b13
_:b12 :url <http://example.org>

where the co-references among the bnode names are completely  
different from the ones in the original graph and definitely  
misleading. Note that restricting the bnode names to the active set  
doesn't solve the problem.

I'd like to remember that here, by allowing bnodes in the pattern  
solutions, we're walking on thin ice. Consider the troubles that NULL  
values are giving in the well established database setting. If we  
want to go down the G |= S(P) slope, I think we should adopt the NULL  
value, otherwise we're getting into serious troubles w.r.t.  
interoperability.

The more involved 'G |= S(G + P)' semantics (Pat, allow me to avoid  
the 'M' ;) - when simple entailment is adopted - is actually  
mimicking the intuitive graph matching algorithm, leaving at the same  
time an open door to different entailments.

--sergio

On 3 Jan 2006, at 18:32, Pat Hayes wrote:

> Guys, a thought occurs to me regarding the 'basic' definitions and  
> how to simplify them.
>
> Now we have got rid of told bnodes, the only reason for introducing  
> any complexity into the 'naive definition' of answer binding B, viz
>
> G (simply) entails B(Q)
>
> is to get rid of the inevitable redundancy that this produces when  
> it is read as an 'iff' definition of answer binding, arising from  
> the fact that if B is an answer then any C which is like B but  
> inserts a different bnode instead of some binding in B, is also an  
> answer, by this naive criterion. So, if ?x/foo is an answer then we  
> get infinitely many answers of the form ?x/_:1, ?x/_:2, etc.. So,  
> we put in this condition that only bnodes from G may occur, which  
> itself kind of smells bad because it requires answer-binding bnodes  
> to be meaningful outside their scope: but then we still get a large  
> finite amount of redundancy, as noted by Sergio http://lists.w3.org/ 
> Archives/Public/public-rdf-dawg/2006JanMar/0001.html, so we have to  
> tweak the definition with the (G entails (G union...)) trick, and  
> then we get all embroiled in exactly how to specify the union,  
> which is subtle and delicate, and so on. Sigh.
>
> All of this definitional twisting arises because we feel a need to  
> somehow get rid of silly redundancies in the answers. So, here's an  
> alternative proposal. We don't get rid of them by tweaking the  
> definition: we leave the definition naive, think of it as a  
> necessary condition on answer bindings, and we add a remark about  
> answer sets, that servers are not obliged to deliver 'redundant'  
> answers. (Not that they are required to be non-redundant, but that  
> they are free to omit redundancy.) And we say explicitly that if A  
> and B are two answers such that A(Q) is an instance of B(Q) (in the  
> traditional RDF sense of instance, ie mapping bnodes to terms) then  
> B is redundant. So if there are answers ?x/foo and ?x/_:75, then  
> the second is redundant: and if there are answers ?x/_:22 and ?x/_: 
> 47, then one of them is redundant provided that the other is given  
> as an answer.
>
> This completely cuts through the definitional maze, since we can  
> keep the direct, naive, unqualified definition of 'answer binding',  
> and smile happily at the observation that there are infinitely many  
> answers: yes, there are, but all but finitely many of them are  
> redundant. It also means that we can say with a straight face that  
> bnodes in answers are completely scoped to the answer, and have no  
> connection to bnodeIDs in the graph. It also means that redundancy  
> of answers can be locally checked just by looking at the answer  
> bindings themselves, not by checking the entire graph topology (as  
> with the "G entails (G union...)" trick.) It also, BTW, allows  
> future extensions of SPARQL which use non-simple entailment to also  
> use stronger notions of 'redundant', e.g. to decide that, say, RDFS  
> tautologies are also redundant, and to omit those bindings from  
> answer sets. Or not: but the point is that the flexibility is  
> there, without needing to mess around with issues like whether or  
> not this is 'real' RDFS entailment. We have a way to separate the  
> bounds on answers into lower bounds  which are always some kind of  
> entailment  and upper bounds, which are some kind of redundancy- 
> avoidance criterion. Within those bounds, servers must deliver all  
> the answers.
>
> Now, this does ALLOW a dumb server to be highly redundant in the  
> answers it gives, like in Sergio's
>
> _:a :p _:b
> _:b :url <http://example.org>
>
> example in http://lists.w3.org/Archives/Public/public-rdf-dawg/ 
> 2006JanMar/0001.html. But OK, so it does: redundant answers are not  
> actually wrong, after all, only redundant: and we have already  
> considered the idea of requiring non-redundancy, and rejected it as  
> too high a standard. I suggest that it might be good to allow a  
> very simple-minded conforming SPARQL server to be messily  
> redundant, if such a thing could be implemented very easily. Normal  
> commercial pressures will ensure that smarter SPARQL engines will  
> get written, but we don't need to mandate that dumb ones should not  
> be conformant.
>
> If y'all like this idea, I can do a run through http://www.w3.org/ 
> 2001/sw/DataAccess/rq23/ to tweak the wordings to fit it. It  
> should, if anything, get simpler.
>
> Pat
>
>
> --
> ---------------------------------------------------------------------
> IHMC		(850)434 8903 or (650)494 3973   home
> 40 South Alcaniz St.	(850)202 4416   office
> Pensacola			(850)202 4440   fax
> FL 32502			(850)291 0667    cell
> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>
Received on Thursday, 5 January 2006 11:26:57 UTC