Re: ISSUE: DISTINCT is underspecified

I have just found and read (or reread):
	http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/ 
0008.html

Where Fred has anticipated much of this debate and made arguments in  
favor of both interpretations, and suggests having distinct keywords.  
I can certainly live with that, as I've stated before, but I wanted  
to point something out:

"""Note that adding the part number to the SELECT list will
not necessarily save the query, since the combination of
part number, quantity and price is still not a guaranteed
unique key for line items.  The user is relying on distinct
blank nodes to represent distinct line items.

Of course, from the point of view of "RDF Semantics"
that would be a redundant graph, for example, one that
asserts "There exists a line item whose part is XYZ,
quantity is 1 and price is 10.99" and asserts again
"There exists a line item whose part is XYZ, quantity is 1 and
price is 10.99".  Thus one could say that this is a misuse
of RDF.  This may be technically true, but I wonder if insisting
on this point will really serve the users.  If you read the RDF
Primer, the application design above makes sense. You have a line
item; you don't want to bother creating an IRI for each line
item; so you make a blank node for each line item.
"RDF Semantics", on the other hand, is a dense document
with talk about hypothetical universes that are interpretations
of a graph.  This is not the kind of material that will make
its way into seminars, courses, how-to books, etc."""

I believe the RDF Primer did a dis-service by encouraging this  
misunderstanding. I think we should encourage people to create IRIs  
in these circumstances. Even if we allow for these distinctions in  
answer sets, we cannot enforce that for RDF graphs in general. Given  
the prevalence of "use RDF for representing data" and the existence  
of "CONSTRUCT" it would be reasonable for a user to think that  
CONSTRUCT and SELECT will bear certain relations to each other. But  
if a tool, somewhere, decides to lean the graph (which is  
semantically safe from an RDF point of view) it will violate the  
user's modeling expectations. They are making unfounded assumptions,  
of course, but that's a cold comfort.

This is why I think what Fred called constructive semantics is  
potentially seriously misleading, even if it is the generally better  
choice (for RDF and for SPARQL; note that Fred's point is that people  
*modeled* things a certain way with certain expectations; also note  
that this isn't the lean graph case, so points to a third family of  
meanings for DISTINCT).

My general position is that we are doing an RDF query  language, so  
we need to at least make the semantics of RDF available and  
transparent to the user (I acknowledge that Pat has an argument that  
if you scope the bnodes to the entire sparql expression, that his  
DISTINCT is consistent with the semantics of RDF; I still think it's  
less *transparent*, but I shall address that in another post). So I  
support including the existential reading. I'm becoming more and more  
convinced that making that the only reading would be, in the long  
run, beneficial to users, along the lines of the strictness of XML  
parsers with regard to well formedness. However, I'm still undecided  
on that point, so am still amenable to having multiple readings, esp.  
wrt DISTINCT.

Cheers,
Bijan.

Received on Saturday, 19 August 2006 16:50:12 UTC