Re: ISSUE: DISTINCT is underspecified

On Aug 16, 2006, at 5:25 AM, Pat Hayes wrote:
[snip]
>> So consider the following answer set:
>>
>> 	?x
>> 	"01"^^<xsd:integer>
>> 	"1"^^<xsd:integer>
>>
>> This is nonredundant by <http://www.w3.org/TR/rdf-concepts/ 
>> #section-Literal-Equality> but redundant by <http://www.w3.org/TR/ 
>> xpath-functions/#func-numeric-equal>.
>>
>> So, should a DISTINCT answer set contain one answer or two?
>
> Two, since the RDF conventions win over the Xpath conventions. I  
> agree its dumb, but we should be RDF consistent.

That's one choice we could make. In the telecon today, there seemed  
to be some pull for dealing with this as one under D-entailment and  
two when not. That seems sensible to me.

BTW, it's unclear to me if people are expecting such terms to be  
"normalized" under D-entailment. If not, these will be the two  
answers without distinct (i.e, instead of :
	?x
	1
	1
or
	?x
	"1"^^<xsd:integer>
	"1"^^<xsd:integer>

So, do you agree with the tentative consensus to deal with this case  
as being redundant under D-entailment and not when not?

Is *all* value testing sensitive to D-entailment? Section 3 seems  
unclear on this point.

Take this example from section 3.2:

"""PREFIX  dc:  <http://purl.org/dc/elements/1.1/>
PREFIX  ns:  <http://example.org/ns#>
SELECT  ?title ?price
WHERE   { ?x ns:price ?price .
           FILTER (?price < 30) .
           ?x dc:title ?title . }"""

Against:

"""@prefix dc:   <http://purl.org/dc/elements/1.1/> .
@prefix :     <http://example.org/book/> .
@prefix ns:   <http://example.org/ns#> .

:book1  dc:title  "SPARQL Tutorial" .
:book1  ns:price  42 .
:book2  dc:title  "The Semantic Web" .
:book2  ns:price  23 ."""

The answer in the text, without indicating that D-Entailment is in  
place and before the first mention of D-Entailment, is:

	?title				?price
	"The Semantic Web"	23

I presume this would be the same if the object of the last triple  
were "023"^^xsd:integer. The comparison operators operate on the  
value space of the literal, AFAICT.

Do we make comparisons sensitive to the entailment relations across  
the board? That is, that the algebra (conceptually) is insensitive to  
the value space if the underlying semantics are (hmm this is tricky).  
But then there are two issues, 1) most FILTERS are going to fail a  
lot more often than we expect, and 2) there are issues about things  
like :x rdf:type rdfs:Literal, or :x rdf:type xsd:positiveInteger. :y  
rdf:type xsd:negativeInteger. :x > :y? (i.e., aspects of the  
underlying semantics which people have argued that they want the  
operators to be insensitive to)

[snip]
>> What principle should guide? If we look to SQL (and curse ANSI et  
>> al for not having free, hyperlinked versions of their standards on  
>> line), we see that DISTINCT vs. ALL is set vs. bag (e.g., <http:// 
>> 66.102.9.104/search?q=cache:H8E-i-rh4N4J:www.csee.umbc.edu/ 
>> ~pmundur/courses/CMSC661-02/lecture7.ppt+SQL+semantics 
>> +distinct&hl=en&ct=clnk&cd=9&client=safari>)
>>
>> Similarly, DQL and OWLQL define various levels of refinement based  
>> on answer subsumption.
[snip]
>> I would also like to be a very strong push in for a strong anti- 
>> redundancy reading. I think 1) and 2) should have only one answer  
>> (if DISTINCT). The principle is that no DISTINCT set of answers  
>> should contain redundancy.
>
> We have to be sure that we know exactly what redundancy means.

Or agree on a notion. There are different notions, just like there  
are different notions of equality. My upfront principle is that we  
should clearly identify all the useful and justifiable forms of non- 
redundancy and decide which to support.

> A problem is that the answer set considered in isolation might seem  
> to be redundant, but does not seem redundant when more information  
> is added, information which is in the graph. (Similarly for  
> leanness of a graph: a lean graph can have non-lean subgraphs.)

These aren't exactly "similar", yes? In the second case, if you  
*remove information* from the lean graph, you get non-lean subgraphs.  
I.e., subgraph operations do not preserve leanness. Similarly, I  
don't see why algebraic operations on answer sets should be held to  
preserve non-redundancy. In fact, we know they don't and yet expect  
to remove introduced redundancy. Consider:

	:bijan :loves :mochi.
	:bijan :hates :mochi.

And these two queries:
	SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
	SELECT DISTINCT ?x ?y {?x ?p ?y}

The second is the projection of the first.

The obvious answer sets are:
	?x		?p		?y
	:bijan	:loves	:mochi
	:bijan	:hates	:mochi

(Neither answer subsumes the other.)
and
	?x		?y
	:bijan	:mochi

But wait! If we consider the graph, there should be  
*two* :bijan :mochis! It tells us something about the graph!

I don't think "telling us something about the graph" can be sustained.

> So in the above case, for example, if the graph had contained
>
> _:x :p :mochi
> _:x :q :fred
> Bijan :p :mochi
>
> with the pattern query ?x :p ?y, then it might actually be quite  
> useful to be told that there are two answers, <_:x :mochi> and  
> <Bijan :mochi> when one has explicitly requested DISTINCT answers,  
> since this would tell you that there was something known about _:x,  
> whoever he is, to at least distinguish him from the other answers,  
> i.e. that the apparent redundancy was something of an illusion.

How does this not hold for :bijan above? Redundancy in answers is  
redundancy *in answers*, not with regard to the original graph.

This is why I think you are treating BNodes as names with this  
approach. The bnode is "lexically" distinct from :bijan, so it  
prevents subsumption of the answer. But that is not obviously in with  
RDF semantics. If the principle is "redundant from the point of view  
of information in the graph", then we should have two :bijan :mochi's  
above. It's hard to see that anyone endorses *that*.

> If told bnodes are useable,

Which we have explicitly punted on in SPARQL.

> this is almost a direct hint to ask some more questions about _:x;  
> and even if they are not, the information may be useful.

I would like to see some use case in which they are useful. I think  
they are misleading, myself. Plus, I can always get back that  
information in a non-misleading way by removing the projection.

>> This is akin to a lean graph, and is likely similarly  
>> computationally expensive.
>
> Worst case, but I don't think it will be too bad in almost all  
> practical cases.

Oh, sorry, didn't mean to suggest otherwise. I'm not afraid of  
leaning operations.

>>  (Note that source graph leanness is not sufficient, as 3) shows).  
>> Thus, I think this is more characterisitc of the semantics of RDF.
>
> I am not sure where you get this intuition about the RDF semantics  
> deciding this one way or the other.

Well, historically, it's because the "redundancy of answer set is  
tied to redunancy in the queried graph" reading never occurred to me.  
Upon examination, it is being inconsistently applied, so I think  
what's really going on is some feeling of "nameness" about BNodes.

> It doesn't seem to me to be an essentially *semantic* issue at all.

Well, I think it's based on a misreading of BNodes, as I said above.  
But let's put that aside for a moment.

>> I would encourage also text that made the decision parallel that  
>> of what I've seen of SQL ALL vs. DISTICT, to wit, that ALL is a  
>> *computational* computational compromised and not intended to  
>> correspond to the "math" of the situation.
>
> Agreed.
>
>> For many purposes, of course, that's just fine. Redundancy for  
>> time is a sensible tradeoff. And I applaud have predicable,  
>> "minimal" redundancy, that is, no more than what is in the graph.  
>> That's computationally and implementationally straightfoward.
>
> Agreed.
>
>> However, I think we should *not* encourage a "semantic" reading of  
>> that redundancy, where in people interpret the redundancy as a  
>> *significant* part of the data.
>
> Well... Im not sure what exactly you mean. We shouldn't encourage  
> people to think that the mere presence of a bnode means that there  
> is a separate 'thing' there, i.e. to treat bnodes as identifiers.

Actually, we can't assume that separate identifiers mean distinct  
things, yes? I've actually been wondering if RDF has not just the  
finite model property, but the single object model property. Except  
for XMLLiterals, can't ever name be mapped to a single entity? And  
every relation to a reflexive entity? Isn't that a model, always?  
(Except for XMLLiterals.) I mean, given the absence of equality or  
negation in any form, what can a theory do to distinguish entities?  
(Databases, etc., get around this by having the UNA and various forms  
of CWA.)

> But there can be cases where the apparent redundancy in an answer  
> *is* in a sense a significant part of the data, like the above.

Then you must explain why the two :bijan :mochis aren't redundant. Or  
embrace their non-redundancy.

> I can see good arguments both ways, but on balance I think I prefer  
> the idea of redundancy being determined by the graph rather than by  
> the answer document.

Obviously I come down very strongly on the other side.

> This only matters seriously when we allow told bnodes:

This isn't true as I've argued above.

> and in that case, we have basically agreed that the scope of these  
> bnodes is the scoping graph rather than an answer document, so this  
> seems the right context to use to determine redundancy.

I don't understand why algebraic operations can't introduce  
redundancy on their own. They clearly can. So I don't see how  
redunancy in general can be determined by the redundancy of the graph.

>> In other words, we're not supporting editors that care about the  
>> specific assertional structure of a graph.
>
> Hmm, why not? Isnt this exactly what a query engine *should* be  
> able to discover?

By "assertional" I mean "told". I think query answers should be  
largely indifferent to equivalent expressions.

> Maybe Im not following what exactly you mean.

Just that we shouldn't focus too much on querying the syntax, I mean,  
in principle. I realize that's what most implementations *do*, but  
it's a strange principle not clearly marked in the text. And I think  
we should make the semantics available. (Now, of course, we're  
disagreeing on what the semantics require. Let me weaken my principle  
to say that it should help people understand the semantics of the  
graph.)

Cheers,
Bijan.

Received on Wednesday, 16 August 2006 08:49:46 UTC