W3C home > Mailing lists > Public > public-rdf-dawg@w3.org > July to September 2006

Re: ISSUE: DISTINCT is underspecified

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 17 Aug 2006 00:34:48 -0700
Message-Id: <p06230943c109bc3aac66@[]>
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>

>On Aug 16, 2006, at 5:25 AM, Pat Hayes wrote:
>>>So consider the following answer set:
>>>	?x
>>>	"01"^^<xsd:integer>
>>>	"1"^^<xsd:integer>
>>>This is nonredundant by 
>>><http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality> but 
>>>redundant by 
>>>So, should a DISTINCT answer set contain one answer or two?
>>Two, since the RDF conventions win over the Xpath conventions. I 
>>agree its dumb, but we should be RDF consistent.
>That's one choice we could make. In the telecon today, there seemed 
>to be some pull for dealing with this as one under D-entailment and 
>two when not. That seems sensible to me.

Yes, but see below for a counter-argument.

>BTW, it's unclear to me if people are expecting such terms to be 
>"normalized" under D-entailment. If not, these will be the two 
>answers without distinct (i.e, instead of :
>	?x
>	1
>	1
>	?x
>	"1"^^<xsd:integer>
>	"1"^^<xsd:integer>
>So, do you agree with the tentative consensus to deal with this case 
>as being redundant under D-entailment and not when not?

I agree it seems sensible, but I can also see arguments in the other 
direction. What would you say to the example using owl:sameAs in my 
earlier email? The cases seem similar, in that we have two lexically 
distinct expressions which can both bind to a query variable, and are 
known to be coreferential. Should we give two answers or one in cases 
like this? I think it might be worth deciding this on general 
grounds, and I would suggest that (contrary to the above tentative 
consensus) the wisest choice would be to return all the bindings, 
even though this may be considered semantically redundant. In the 
case of literals, as here, it seems kind of silly since presumably 
everyone, including the querying agent, knows that this is a 
redundancy: but the principle would still hold, that one ought to 
deliver all the bindings on the grounds that (1) the querying agent 
might not know they are redundant, and (2) in cases where it does 
not, only being given one of the answers may not convey the required 
information (and will require an arbitrary choice in some cases), and 
(3) in cases where it does, the agent can (in principle) detect the 
redundancy, and so not be misled by it.

>Is *all* value testing sensitive to D-entailment?

For which D, is the real question that matters. We have to consider 
how to handle possibly unknown datatypes.

>Section 3 seems unclear on this point.
>Take this example from section 3.2:
>"""PREFIX  dc:  <http://purl.org/dc/elements/1.1/>
>PREFIX  ns:  <http://example.org/ns#>
>SELECT  ?title ?price
>WHERE   { ?x ns:price ?price .
>           FILTER (?price < 30) .
>           ?x dc:title ?title . }"""
>"""@prefix dc:   <http://purl.org/dc/elements/1.1/> .
>@prefix :     <http://example.org/book/> .
>@prefix ns:   <http://example.org/ns#> .
>:book1  dc:title  "SPARQL Tutorial" .
>:book1  ns:price  42 .
>:book2  dc:title  "The Semantic Web" .
>:book2  ns:price  23 ."""
>The answer in the text, without indicating that D-Entailment is in 
>place and before the first mention of D-Entailment, is:
>	?title				?price
>	"The Semantic Web"	23
>I presume this would be the same if the object of the last triple 
>were "023"^^xsd:integer. The comparison operators operate on the 
>value space of the literal, AFAICT.

Well, that really has never been clear to me. It seems to me that 
some of them do and some of them don't.

>Do we make comparisons sensitive to the entailment relations across 
>the board? That is, that the algebra (conceptually) is insensitive 
>to the value space if the underlying semantics are (hmm this is 
>tricky). But then there are two issues, 1) most FILTERS are going to 
>fail a lot more often than we expect, and 2) there are issues about 
>things like :x rdf:type rdfs:Literal, or :x rdf:type 
>xsd:positiveInteger. :y rdf:type xsd:negativeInteger. :x > :y? 
>(i.e., aspects of the underlying semantics which people have argued 
>that they want the operators to be insensitive to)

Right. It is VERY tricky to give any principled account of how these 
operators interact with a referential semantics. I suspect that the 
only way to do it fully generally would be to think of them as 
applying to pairs of <lexical form, value>, or perhaps to the lexical 
form as a default, but in some cases being able to apply an 
interpretation to it and so appear 'semantic'; but the key point is 
that the interpretation can only be done using information which is 
locally available, and hence rather weakly.

>>>What principle should guide? If we look to SQL (and curse ANSI et 
>>>al for not having free, hyperlinked versions of their standards on 
>>>line), we see that DISTINCT vs. ALL is set vs. bag (e.g., 
>>>Similarly, DQL and OWLQL define various levels of refinement based 
>>>on answer subsumption.
>>>I would also like to be a very strong push in for a strong 
>>>anti-redundancy reading. I think 1) and 2) should have only one 
>>>answer (if DISTINCT). The principle is that no DISTINCT set of 
>>>answers should contain redundancy.
>>We have to be sure that we know exactly what redundancy means.
>Or agree on a notion.

Right, thats what I meant.

>There are different notions, just like there are different notions 
>of equality. My upfront principle is that we should clearly identify 
>all the useful and justifiable forms of non-redundancy and decide 
>which to support.

Hard to disagree with that :-)

>>A problem is that the answer set considered in isolation might seem 
>>to be redundant, but does not seem redundant when more information 
>>is added, information which is in the graph. (Similarly for 
>>leanness of a graph: a lean graph can have non-lean subgraphs.)
>These aren't exactly "similar", yes?

They seem very similar to me, but perhaps we are thinking about them 
differently. The similarity I seem to detect is based on thinking of 
an answer as fully realized in the 'answer graph', ie the instance of 
the query pattern corresponding to the answer binding.

>  In the second case, if you *remove information* from the lean 
>graph, you get non-lean subgraphs. I.e., subgraph operations do not 
>preserve leanness. Similarly, I don't see why algebraic operations 
>on answer sets should be held to preserve non-redundancy. In fact, 
>we know they don't and yet expect to remove introduced redundancy. 
>	:bijan :loves :mochi.
>	:bijan :hates :mochi.
>And these two queries:
>	SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>	SELECT DISTINCT ?x ?y {?x ?p ?y}
>The second is the projection of the first.
>The obvious answer sets are:
>	?x		?p		?y
>	:bijan	:loves	:mochi
>	:bijan	:hates	:mochi
>(Neither answer subsumes the other.)
>	?x		?y
>	:bijan	:mochi
>But wait! If we consider the graph, there should be *two* :bijan 
>:mochis! It tells us something about the graph!
>I don't think "telling us something about the graph" can be sustained.

I don't think that Ive used this phrase, or anyway not with the 
meaning that you are giving to it here. I agree, simple repetition in 
answers is 'bad' redundancy.

>>So in the above case, for example, if the graph had contained
>>_:x :p :mochi
>>_:x :q :fred
>>Bijan :p :mochi
>>with the pattern query ?x :p ?y, then it might actually be quite 
>>useful to be told that there are two answers, <_:x :mochi> and 
>><Bijan :mochi> when one has explicitly requested DISTINCT answers, 
>>since this would tell you that there was something known about _:x, 
>>whoever he is, to at least distinguish him from the other answers, 
>>i.e. that the apparent redundancy was something of an illusion.
>How does this not hold for :bijan above?

Because :bijan and _:x might be two different people - at the least, 
different information about them is recorded in the graph - but 
:bijan and :bijan are definitely the same guy, and anything that 
graph says about one it also says about the, er, other.

>Redundancy in answers is redundancy *in answers*, not with regard to 
>the original graph.

NO. That is exactly the point. If we allow told bnodes to be used 
across queries, to extract further information about the entity these 
bnodes denote, then we cannot at the same time think of issues like 
redundancy in a purely answer-specific scope. There is a fundamental 
clash here. Either answers are entirely self-contained, and all 
bnodeIDs in them are scoped locally to the answer document (in which 
case, your position would make sense, I agree) or else told bnodes 
are possible. We have been careful to allow told bnodes in the 
current design, and I understand that you approve of this. But then 
we have to be careful when we define redundancy to keep the bnode 
scopes consistent.

>This is why I think you are treating BNodes as names with this 
>approach. The bnode is "lexically" distinct from :bijan, so it 
>prevents subsumption of the answer.

Its not that simple. If the bnode and :bijan were redundant in the 
original graph, then by all means remove this in the answer when 
DISTINCT is specified. The point is not that they are superficially 
different in lexical form, but that different information is recorded 
'attached' to the two terms.

>But that is not obviously in with RDF semantics. If the principle is 
>"redundant from the point of view of information in the graph", then 
>we should have two :bijan :mochi's above.

Not if we state redundancy carefully. Try the draft definition in my 
earlier email.

>It's hard to see that anyone endorses *that*.


>>If told bnodes are useable,
>Which we have explicitly punted on in SPARQL.

Well, we have been careful to not forbid this. If we did forbid it, 
the definitions could be made simpler. But I think this would be a 
mistake, as there seem to be many applications which will use this if 
we make it legal. Or if we don't, for that matter.

>>this is almost a direct hint to ask some more questions about _:x; 
>>and even if they are not, the information may be useful.
>I would like to see some use case in which they are useful.

See the Mary example in my other email. Im sure you can invent more 
along the same lines. If we assume that bnodes will be used to record 
information about non-identified entities in the same way that URIs 
are used to record information about identified ones, these cases 
will come up immediately.
>  I think they are misleading, myself. Plus, I can always get back 
>that information in a non-misleading way by removing the projection.

You can if you know *how* to remove it, ie what extra question to ask 
as part of the query. But the querying agent may not be able to know 
this without first asking simpler questions. If I havn't even seen 
_:X mentioned, I don't know to ask for more information about it.

>>>This is akin to a lean graph, and is likely similarly 
>>>computationally expensive.
>>Worst case, but I don't think it will be too bad in almost all 
>>practical cases.
>Oh, sorry, didn't mean to suggest otherwise. I'm not afraid of 
>leaning operations.
>>>  (Note that source graph leanness is not sufficient, as 3) shows). 
>>>Thus, I think this is more characterisitc of the semantics of RDF.
>>I am not sure where you get this intuition about the RDF semantics 
>>deciding this one way or the other.
>Well, historically, it's because the "redundancy of answer set is 
>tied to redunancy in the queried graph" reading never occurred to 
>me. Upon examination, it is being inconsistently applied, so I think 
>what's really going on is some feeling of "nameness" about BNodes.

I think you have made a mistaken diagnosis here. The issue really is 
the *scope* of the bnodeIDs (or the boundaries of the RDF graphs, if 
you like.) You seem to be (?) assuming that this must be determined 
by the answer document, but it need not be. If we allow told bnodes, 
then there will be cases where is *must not* be. Those are the cases 
where the issue of redundancy between bnodes and URIrefs cannot be 
determined properly by lookiing at the answer document alone.

>>It doesn't seem to me to be an essentially *semantic* issue at all.
>Well, I think it's based on a misreading of BNodes, as I said above. 
>But let's put that aside for a moment.
>>>I would encourage also text that made the decision parallel that 
>>>of what I've seen of SQL ALL vs. DISTICT, to wit, that ALL is a 
>>>*computational* computational compromised and not intended to 
>>>correspond to the "math" of the situation.
>>>For many purposes, of course, that's just fine. Redundancy for 
>>>time is a sensible tradeoff. And I applaud have predicable, 
>>>"minimal" redundancy, that is, no more than what is in the graph. 
>>>That's computationally and implementationally straightfoward.
>>>However, I think we should *not* encourage a "semantic" reading of 
>>>that redundancy, where in people interpret the redundancy as a 
>>>*significant* part of the data.
>>Well... Im not sure what exactly you mean. We shouldn't encourage 
>>people to think that the mere presence of a bnode means that there 
>>is a separate 'thing' there, i.e. to treat bnodes as identifiers.
>Actually, we can't assume that separate identifiers mean distinct things, yes?

Yes. That is, we can't. RDF doesn't have the UNA built in.

>I've actually been wondering if RDF has not just the finite model 
>property, but the single object model property. Except for 
>XMLLiterals, can't ever name be mapped to a single entity?

Yes. To itself, in fact. See the Herbrand lemma in the RDFMT doc.

>And every relation to a reflexive entity?

Reflexive? I guess it could be, though that would be gratuitous.

>Isn't that a model, always? (Except for XMLLiterals.) I mean, given 
>the absence of equality or negation in any form, what can a theory 
>do to distinguish entities?

It can't.

>(Databases, etc., get around this by having the UNA and various forms of CWA.)
>>But there can be cases where the apparent redundancy in an answer 
>>*is* in a sense a significant part of the data, like the above.
>Then you must explain why the two :bijan :mochis aren't redundant. 
>Or embrace their non-redundancy.

No, two :bijans are always redundant. This isn't one of the cases I 
was talking about.

>>I can see good arguments both ways, but on balance I think I prefer 
>>the idea of redundancy being determined by the graph rather than by 
>>the answer document.
>Obviously I come down very strongly on the other side.
>>This only matters seriously when we allow told bnodes:
>This isn't true as I've argued above.

My point is that if we don't have told bnodes, if they are declared 
illegal, then I agree with you about redundancy being determined by 
examining the answer in isolation. This is really all about what is 
the appropriate scope for bnodes in answers.

>>and in that case, we have basically agreed that the scope of these 
>>bnodes is the scoping graph rather than an answer document, so this 
>>seems the right context to use to determine redundancy.
>I don't understand why algebraic operations can't introduce 
>redundancy on their own. They clearly can. So I don't see how 
>redunancy in general can be determined by the redundancy of the 
>>>In other words, we're not supporting editors that care about the 
>>>specific assertional structure of a graph.
>>Hmm, why not? Isnt this exactly what a query engine *should* be 
>>able to discover?
>By "assertional" I mean "told". I think query answers should be 
>largely indifferent to equivalent expressions.
>>Maybe Im not following what exactly you mean.
>Just that we shouldn't focus too much on querying the syntax, I 
>mean, in principle. I realize that's what most implementations *do*, 
>but it's a strange principle not clearly marked in the text.

Well, this is an aside to our main discussion here, but I think that 
it would be quite acceptable to have an RDF query standard which was 
defined *entirely* syntactically, and simply treated an RDF graph as 
a triple store, and used essentially algebraic operations to troll 
through it for patterns that match and satisfy superficial conditions 
(which might include semantic conditions if those can be computed 
locally, eg typed literal values). This was basically the design we 
had originally, about which so many protests were received wanting a 
more 'semantic' account. The argument for this position would be 
based on SWeb pragmatics and I doubt if I could convince you to agree 
with it, but it would certainly enable us to get to closure a lot 
quicker, and I suspect might be of more actual use to Sweb deployment 
(though admittedly not to science) than anything we are going to come 
up with from the current line of discussion. But OK, let me stop 
howling about this, as it is a distraction from the current thread.

>  And I think we should make the semantics available. (Now, of 
>course, we're disagreeing on what the semantics require. Let me 
>weaken my principle to say that it should help people understand the 
>semantics of the graph.)

OK, Im quite happy with that reading. But I still think that its 
important to not suppress answers which can be used to extract 
*semantically* distinct information entailed by the graph. I guess my 
point is that it is the semantics of the *graph*, not of the 
*answers*, that likely matter most to a querying agent.



IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 17 August 2006 07:35:06 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:00:51 UTC