Re: bnodes as answer bindings from Pat Hayes on 2006-08-08 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 8 Aug 2006 06:54:37 -0700
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230920c0fdb7abba4d@[192.168.1.6]>
>On Aug 7, 2006, at 8:40 PM, Pat Hayes wrote:
>
>>>Slight emendation:
>>>
>>>On Aug 7, 2006, at 5:22 PM, Bijan Parsia wrote:
>>>[snip]
>>>
>>>>"""The answer set of a query is the largest set of query answers 
>>>>that are entailed by the answer KB such that no answer in the set 
>>>>is entailed by any other answer in the set."""
>>>>
>>>>Non-redundancy.
>>>[snip]
>>>
>>>DQL distinguishes between the answer set and the response set:
>>>
>>>"""Response Set
>>>While there are no global requirements on a response set other 
>>>than that all its members are correct answers, it is recommended 
>>>that servers ensure that answer bundles do not contain duplicate 
>>>or redundant answers, i.e., answers which are subsumed by other 
>>>answers.  One answer subsumes another if its bindings are a 
>>>superset of the bindings in the other answer.  Servers that are 
>>>able to guarantee that their response sets contain no duplicate 
>>>answers can be called non-repeating.  Servers that are able to 
>>>guarantee that their response sets contain no duplicate or 
>>>redundant answers can be called terse.  Servers that are able to 
>>>guarantee that their response sets will be correctly terminated 
>>>with 'none' can be called complete."""
>>>
>>>OWLQL (<http://ksl-web.stanford.edu/KSL_Abstracts/KSL-03-14.html>) 
>>>as a more elaborate discussion.
>>>
>>>I think I prefer the way that SPARQL does it, if DISTINCT gets 
>>>fixed. I certainly don't want to have the granularity of 
>>>redundancy placed at the server level.
>>
>>I still think this is the best stance for the standard to adopt.
>
>It's silly. We can easily do this on a query by query level and let 
>servers do the best they can and communicate when they can't do 
>better.
>
>>I can see a perfectly good utility for servers which run fast but 
>>do not *guarantee* non-redundancy.
>
>So they should fault if the user requests it. Which I think they are 
>by saying "DISTINCT".

OK, I see now what your point is. I agree, though see below.

>>Im quite sanguine with this because the economic pressures on 
>>servers and customers seem to converge on eliminating redundancy 
>>where practicable: there is no motivation for anyone to 
>>gratuitously introduce redundancy for no reason,
>
>And yet they do. Plus it's not always gratuitious, but yet not 
>desired. E.g., aggregation, or just multiple people entering data 
>over time.

True, it happens. But the question is, whose responsibility is it to 
fix snarky data? I don't see why a query engine should take on the 
responsibility to do this. I see a SPARQL engine as basically a 
broker between someone who has information stored and someone else 
who needs information. Its not the broker's job, necessarily, to fix 
up the data perfectly. Although I agree with your point above about 
if the users asks for a form of perfection, then the engine should 
deliver it or fail.

>>only to save the considerable work involved in checking for 
>>non-redundancy when an absolute guarantee of nonredundancy is not 
>>required.
>
>Then don't include the DISTINCT keyword and all conforming servers 
>will behave as you like.
>
>>I would vastly prefer to use such a server than one that times out 
>>trying to establish a minimal answer set, particularly when we 
>>might be talking about answer sets with high orders of magnitude.
>
>Pat, I guess what you don't understand is

Indeed, I did *not* understand that this was your position. OK, we 
are much closer than it seemed.

>that, as I've said several times now

Sorry I missed that.

>, it's perfectly reasonable to allow for (reasonable) redundancy 
>with a plain select clause (this is how SQL works, I believe) but 
>have a non-redundant answer set *WHEN THE USER REQUESTS IT*.

I agree.

>And the natural way for a user to request it is with DISTINCT. But 
>then we have to define what a non-redundant answer *is*.

True.

>And in the standard conforming reading of the RDF Semantics, it's 
>going to involve some work and cannot involve treating BNodes as 
>denoting terms. At least, it would not be sensible to do so.

I take your point, but let me respond to it. IF we want to allow for 
the possibility of 'told bnodes', ie if we want to allow bnodes to be 
delivered as answer bindings and the subsequently re-used in later 
queries, with the intention of asking for more information about the 
thing asserted to exist; then what appears to be redundancy in 
answers might not really be redundant, because there may be answers 
ex:a and _:x given which are not redundant because the KB contains 
information about _:x which was not mentioned in the query but which 
can be used to distinguish it from ex:a. (What this amounts to, 
formally, is allowing bnode scopes to extend across multiple answer 
documents, by the way, hence bnodes act even more like true names: in 
effect, the entire sequence of transactions takes place inside the 
scope of the existential quantifier.) I am not sure what we should do 
about DISTINCT in these circumstances. It seems to be asking too much 
to require that the semantics actually establishes (not (= a b)) for 
every pair of bindings.

Maybe it is simply too complicated to be both nonredundant and also 
play told-bnode games, and we should simply ignore this matter and 
apply DISTINCT to the actual answer bindings, regardless of what 
other information is available in the KB (?)

>And you can get your desired behavior from *EVERY* SPARQL server by 
>not including the DISTINCT. Why do you want to go shopping around 
>for servers etc.

Honestly, I did not appreciate that this was the position you were 
arguing for. I thought you wanted to impose DISTINCT on every SELECT 
query in every conforming engine, in effect: and it was that position 
which I was opposed to. But it seems we have been violently agreeing 
about this for some time. Sorry about the misunderstanding.

>>>If I can't compute a non-redundant answer because I've run out of 
>>>resources, I should timeout/fault with out of memory, whatever. If 
>>>I have an imcomplete minimizer, I should be able to verify that 
>>>that my answer set is minimal, or fault.
>>
>>*You* should, yes. That is, the user has the option of computing a 
>>minimal answer set if it is absolutely required.
>
>Obviously, I was speaking as a server.

I misunderstood you. Again, I apologize.

>Frankly, I think this sort of behavior is exactly the sort of thing 
>that should be standardize and in the query language. I sincerely 
>doubt most users have the sophistication to get it right, and I 
>don't see why they should have to.
>
>*Who* was going on about putting burdens on the implementers instead 
>of the users? This is a clear case of a ridiculous burden on the 
>users. And there's a nice optout for the implementers: Don't support 
>DISTINCT and advertize that.

I agree, that is exactly the kind of 'free market' options I would 
like to allow.  Purely as a political point, it would be nice if an 
engine could do this and still count as conforming with the spec.

Pat
-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 8 August 2006 13:54:59 UTC