- From: Bijan Parsia <bparsia@cs.man.ac.uk>
- Date: Sat, 12 Aug 2006 04:10:21 +0100
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Here is the text:
"""10.1.3 DISTINCT
The solution sequence can be modified by adding the DISTINCT keyword
which ensures that every combination of variable bindings (i.e. each
solution) in the sequence is unique."""
and the definition:
"""Definition: Distinct Solution Sequence
A Distinct Solution Sequence is a solution sequence in which no two
solutions are the same."""
=====================
Unfortunately, the example doesn't give any hint about the tricky cases.
Why is this underspecified, or rather nonspecified? Because "the
same" depends on your notion of sameness (same for unique).
=====================
LITERALS:
Consider literals: there are often at least two equality operator you
could use (i.e., the type specific one or the general term comparision).
So consider the following answer set:
?x
"01"^^<xsd:integer>
"1"^^<xsd:integer>
This is nonredundant by <http://www.w3.org/TR/rdf-concepts/#section-
Literal-Equality> but redundant by <http://www.w3.org/TR/xpath-
functions/#func-numeric-equal>.
So, should a DISTINCT answer set contain one answer or two? Which
answer should it include?
I tend to favor a type specific comparison where specified, and only
falling back on rdf-term when lacking, but I can see a more general
case. Data values are weird.
=====================
BNODES:
BNodes are much harder, overall. Consider the following answer set:
1) ?x ?y
_:x :mochi
:Bijan :mochi
One (distinct) answer, or two? Clearly, it's not quite right to say
that :Bijan = _:x, although the second answer does subsume the first.
How about:
2) ?x ?y
_:x :mochi
_:y :mochi
(note that this might *not* correspond to redundancy in the data!
Consider:
3) SELECT DISTINCT ?x ?y
{?x ?p ?y}
against:
_:x :loves :mochi.
_:y :hates :mochi.
)
What principle should guide? If we look to SQL (and curse ANSI et al
for not having free, hyperlinked versions of their standards on
line), we see that DISTINCT vs. ALL is set vs. bag (e.g., <http://
66.102.9.104/search?q=cache:H8E-i-rh4N4J:www.csee.umbc.edu/~pmundur/
courses/CMSC661-02/lecture7.ppt+SQL+semantics
+distinct&hl=en&ct=clnk&cd=9&client=safari>)
Similarly, DQL and OWLQL define various levels of refinement based on
answer subsumption.
I would like to dismiss out of hand any argument that relies on "the
same" or "unique" in the current text to to exclude the 1) having
only one answer. We're deciding here, not interpreting.
I would also like to be a very strong push in for a strong anti-
redundancy reading. I think 1) and 2) should have only one answer (if
DISTINCT). The principle is that no DISTINCT set of answers should
contain redundancy. This is akin to a lean graph, and is likely
similarly computationally expensive. (Note that source graph leanness
is not sufficient, as 3) shows). Thus, I think this is more
characterisitc of the semantics of RDF. I would encourage also text
that made the decision parallel that of what I've seen of SQL ALL vs.
DISTICT, to wit, that ALL is a *computational* computational
compromised and not intended to correspond to the "math" of the
situation. For many purposes, of course, that's just fine. Redundancy
for time is a sensible tradeoff. And I applaud have predicable,
"minimal" redundancy, that is, no more than what is in the graph.
That's computationally and implementationally straightfoward.
However, I think we should *not* encourage a "semantic" reading of
that redundancy, where in people interpret the redundancy as a
*significant* part of the data.
In other words, we're not supporting editors that care about the
specific assertional structure of a graph.
I think this is simultaneously most faithful to the RDF Semantics
document, while supporting considerable flexibilty.
These are not formal definitions or proposed text, but I think they
lay out the major issues.
Cheers,
Bijan.
Received on Saturday, 12 August 2006 03:10:35 UTC