Re: ISSUE: DISTINCT is underspecified from Pat Hayes on 2006-08-16 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 15 Aug 2006 21:25:14 -0700
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230940c1081f69eb63@[192.168.1.6]>
>Here is the text:
>
>"""10.1.3 DISTINCT
>The solution sequence can be modified by adding the DISTINCT keyword 
>which ensures that every combination of variable bindings (i.e. each 
>solution) in the sequence is unique."""
>
>and the definition:
>
>"""Definition: Distinct Solution Sequence
>
>A Distinct Solution Sequence is a solution sequence in which no two 
>solutions are the same."""
>
>=====================
>
>Unfortunately, the example doesn't give any hint about the tricky cases.
>
>Why is this underspecified, or rather nonspecified? Because "the 
>same" depends on your notion of sameness (same for unique).
>
>=====================
>
>LITERALS:
>
>Consider literals: there are often at least two equality operator 
>you could use (i.e., the type specific one or the general term 
>comparision).
>
>So consider the following answer set:
>
>	?x
>	"01"^^<xsd:integer>
>	"1"^^<xsd:integer>
>
>This is nonredundant by 
><http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality> but 
>redundant by 
><http://www.w3.org/TR/xpath-functions/#func-numeric-equal>.
>
>So, should a DISTINCT answer set contain one answer or two?

Two, since the RDF conventions win over the Xpath conventions. I 
agree its dumb, but we should be RDF consistent.

>  Which answer should it include?
>
>I tend to favor a type specific comparison where specified, and only 
>falling back on rdf-term when lacking, but I can see a more general 
>case. Data values are weird.

Amen to that. They get weirder as time goes by, cf. the XML schema revision.

>
>=====================
>
>BNODES:
>
>BNodes are much harder, overall. Consider the following answer set:
>
>1)	?x		?y
>	_:x		:mochi
>	:Bijan	:mochi
>
>One (distinct) answer, or two? Clearly, it's not quite right to say 
>that :Bijan = _:x, although the second answer does subsume the 
>first. How about:
>
>2)	?x		?y
>	_:x		:mochi
>	_:y		:mochi
>
>(note that this might *not* correspond to redundancy in the data! Consider:
>
>3)	SELECT DISTINCT ?x ?y
>		{?x ?p ?y}
>
>against:
>
>	_:x :loves :mochi.
>	_:y :hates :mochi.
>)
>
>What principle should guide? If we look to SQL (and curse ANSI et al 
>for not having free, hyperlinked versions of their standards on 
>line), we see that DISTINCT vs. ALL is set vs. bag (e.g., 
><http://66.102.9.104/search?q=cache:H8E-i-rh4N4J:www.csee.umbc.edu/~pmundur/courses/CMSC661-02/lecture7.ppt+SQL+semantics+distinct&hl=en&ct=clnk&cd=9&client=safari>)
>
>Similarly, DQL and OWLQL define various levels of refinement based 
>on answer subsumption.
>
>I would like to dismiss out of hand any argument that relies on "the 
>same" or "unique" in the current text to to exclude the 1) having 
>only one answer. We're deciding here, not interpreting.
>
>I would also like to be a very strong push in for a strong 
>anti-redundancy reading. I think 1) and 2) should have only one 
>answer (if DISTINCT). The principle is that no DISTINCT set of 
>answers should contain redundancy.

We have to be sure that we know exactly what redundancy means. A 
problem is that the answer set considered in isolation might seem to 
be redundant, but does not seem redundant when more information is 
added, information which is in the graph. (Similarly for leanness of 
a graph: a lean graph can have non-lean subgraphs.) So in the above 
case, for example, if the graph had contained

_:x :p :mochi
_:x :q :fred
Bijan :p :mochi

with the pattern query ?x :p ?y, then it might actually be quite 
useful to be told that there are two answers, <_:x :mochi> and <Bijan 
:mochi> when one has explicitly requested DISTINCT answers, since 
this would tell you that there was something known about _:x, whoever 
he is, to at least distinguish him from the other answers, i.e. that 
the apparent redundancy was something of an illusion. If told bnodes 
are useable, this is almost a direct hint to ask some more questions 
about _:x; and even if they are not, the information may be useful.

>This is akin to a lean graph, and is likely similarly 
>computationally expensive.

Worst case, but I don't think it will be too bad in almost all practical cases.

>  (Note that source graph leanness is not sufficient, as 3) shows). 
>Thus, I think this is more characterisitc of the semantics of RDF.

I am not sure where you get this intuition about the RDF semantics 
deciding this one way or the other. It doesn't seem to me to be an 
essentially *semantic* issue at all.

>I would encourage also text that made the decision parallel that of 
>what I've seen of SQL ALL vs. DISTICT, to wit, that ALL is a 
>*computational* computational compromised and not intended to 
>correspond to the "math" of the situation.

Agreed.

>For many purposes, of course, that's just fine. Redundancy for time 
>is a sensible tradeoff. And I applaud have predicable, "minimal" 
>redundancy, that is, no more than what is in the graph. That's 
>computationally and implementationally straightfoward.

Agreed.

>However, I think we should *not* encourage a "semantic" reading of 
>that redundancy, where in people interpret the redundancy as a 
>*significant* part of the data.

Well... Im not sure what exactly you mean. We shouldn't encourage 
people to think that the mere presence of a bnode means that there is 
a separate 'thing' there, i.e. to treat bnodes as identifiers. But 
there can be cases where the apparent redundancy in an answer *is* in 
a sense a significant part of the data, like the above.

I can see good arguments both ways, but on balance I think I prefer 
the idea of redundancy being determined by the graph rather than by 
the answer document. This only matters seriously when we allow told 
bnodes: and in that case, we have basically agreed that the scope of 
these bnodes is the scoping graph rather than an answer document, so 
this seems the right context to use to determine redundancy.

>In other words, we're not supporting editors that care about the 
>specific assertional structure of a graph.

Hmm, why not? Isnt this exactly what a query engine *should* be able 
to discover? Maybe Im not following what exactly you mean.

Pat

>I think this is simultaneously most faithful to the RDF Semantics 
>document, while supporting considerable flexibilty.
>
>These are not formal definitions or proposed text, but I think they 
>lay out the major issues.
>
>Cheers,
>Bijan.


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 16 August 2006 04:25:30 UTC