Re: ISSUE: DISTINCT is underspecified from Seaborne, Andy on 2006-08-14 (public-rdf-dawg@w3.org from July to September 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 14 Aug 2006 09:51:35 +0100
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <44E03997.1040609@hp.com>
We seem to have lost some text at some time in the past: it used to say:

"""
Definition: Distinct Solution Sequence

A Distinct Solution Sequence has no two solutions the same.

For a solution sequence S = ( S1, S2, . . . , Sn), then write set(S) for the
set of solution sequences in S.

      distinct(S) = (Si | Si != Sj for all i != j) and set(distinct(S)) = set(S)
"""

There is a layering with

   * modifiers
   * algebra
   * BGP matching

so DISTINCT is not directly referring to the matching but to the solutions.

So it's that "!=" :: I think it would be better to use language here and not
"!=" because it might imply a specific relationship to "!=" in filters.
"not the same" should mean "not the same when doing graph pattern matching"

D-entailment is not required of all systems.

So, if D-entailment were done in BGP matching, it should include that; if
D-entailment were not done, it should not include that.

Data:
:x :p 1 .
:y :p "01"^^xsd:integer .

Query:
SELECT DISTINCT ?v { ?a :p ?v }

should have one solution if

{ :x :p ?v . :y :p ?v . }

matches else it should have two.


Bijan Parsia wrote:
 >
 > Here is the text:
 >
 > """10.1.3 DISTINCT
 > The solution sequence can be modified by adding the DISTINCT keyword
 > which ensures that every combination of variable bindings (i.e. each
 > solution) in the sequence is unique."""
 >
 > and the definition:
 >
 > """Definition: Distinct Solution Sequence
 >
 > A Distinct Solution Sequence is a solution sequence in which no two
 > solutions are the same."""
 >
 > =====================
 >
 > Unfortunately, the example doesn't give any hint about the tricky cases.
 >
 > Why is this underspecified, or rather nonspecified? Because "the same"
 > depends on your notion of sameness (same for unique).
 >
 > =====================
 >
 > LITERALS:
 >
 > Consider literals: there are often at least two equality operator you
 > could use (i.e., the type specific one or the general term comparision).
 >
 > So consider the following answer set:
 >
 >     ?x
 >     "01"^^<xsd:integer>
 >     "1"^^<xsd:integer>
 >
 > This is nonredundant by
 > <http://www.w3.org/TR/rdf-concepts/#section-Literal-Equality> but
 > redundant by <http://www.w3.org/TR/xpath-functions/#func-numeric-equal>.
 >
 > So, should a DISTINCT answer set contain one answer or two? Which answer
 > should it include?
 >
 > I tend to favor a type specific comparison where specified, and only
 > falling back on rdf-term when lacking, but I can see a more general
 > case. Data values are weird.
 >
 > =====================
 >
 > BNODES:
 >
 > BNodes are much harder, overall. Consider the following answer set:
 >
 > 1)    ?x        ?y
 >     _:x        :mochi
 >     :Bijan    :mochi
 >
 > One (distinct) answer, or two?

Can't tell - it depends on the data and isn't a characteristic of the result 
set alone.

Placing the burden on the calculation of redundancy that requires inspecting 
the whole dataset is too much of a burden as we have discussed before.

 > Clearly, it's not quite right to say that
 > :Bijan = _:x, although the second answer does subsume the first. How about:
 >
 > 2)    ?x        ?y
 >     _:x        :mochi
 >     _:y        :mochi
 >
 > (note that this might *not* correspond to redundancy in the data! Consider:
 >
 > 3)    SELECT DISTINCT ?x ?y
 >         {?x ?p ?y}
 >
 > against:
 >
 >     _:x :loves :mochi.
 >     _:y :hates :mochi.
 > )
 >
 > What principle should guide? If we look to SQL (and curse ANSI et al for
 > not having free, hyperlinked versions of their standards on line), we
 > see that DISTINCT vs. ALL is set vs. bag (e.g.,
 > 
<http://66.102.9.104/search?q=cache:H8E-i-rh4N4J:www.csee.umbc.edu/~pmundur/courses/CMSC661-02/lecture7.ppt+SQL+semantics+distinct&hl=en&ct=clnk&cd=9&client=safari>) 

 >
 >
 > Similarly, DQL and OWLQL define various levels of refinement based on
 > answer subsumption.
 >
 > I would like to dismiss out of hand any argument that relies on "the
 > same" or "unique" in the current text to to exclude the 1) having only
 > one answer. We're deciding here, not interpreting.
 >
 > I would also like to be a very strong push in for a strong
 > anti-redundancy reading. I think 1) and 2) should have only one answer
 > (if DISTINCT). The principle is that no DISTINCT set of answers should
 > contain redundancy. This is akin to a lean graph, and is likely
 > similarly computationally expensive. (Note that source graph leanness is
 > not sufficient, as 3) shows). Thus, I think this is more characterisitc
 > of the semantics of RDF. I would encourage also text that made the
 > decision parallel that of what I've seen of SQL ALL vs. DISTICT, to wit,
 > that ALL is a *computational* computational compromised and not intended
 > to correspond to the "math" of the situation. For many purposes, of
 > course, that's just fine. Redundancy for time is a sensible tradeoff.
 > And I applaud have predicable, "minimal" redundancy, that is, no more
 > than what is in the graph. That's computationally and implementationally
 > straightfoward. However, I think we should *not* encourage a "semantic"
 > reading of that redundancy, where in people interpret the redundancy as
 > a *significant* part of the data.
 >
 > In other words, we're not supporting editors that care about the
 > specific assertional structure of a graph.

The structural access is an important use case.  Supporting editors wanting to 
access the structural and redundant nature of the graph is reasonable.  It is 
also one that people expect to work.

 >
 > I think this is simultaneously most faithful to the RDF Semantics
 > document, while supporting considerable flexibilty.
 >
 > These are not formal definitions or proposed text, but I think they lay
 > out the major issues.
 >
 > Cheers,
 > Bijan.
 >

	Andy
Received on Monday, 14 August 2006 08:52:05 UTC