Re: ISSUE: DISTINCT is underspecified from Bijan Parsia on 2006-08-14 (public-rdf-dawg@w3.org from July to September 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Mon, 14 Aug 2006 17:55:39 +0100
To: andy.seaborne@hp.com
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <D20F6DA8-B201-4C9D-9DDF-D48A61386518@cs.man.ac.uk>
On Aug 14, 2006, at 4:32 PM, Seaborne, Andy wrote:

> Bijan Parsia wrote:
>> On Aug 14, 2006, at 9:51 AM, Seaborne, Andy wrote:
>>> We seem to have lost some text at some time in the past: it used  
>>> to  say:
>>>
>>> """
>>> Definition: Distinct Solution Sequence
>>>
>>> A Distinct Solution Sequence has no two solutions the same.
>>>
>>> For a solution sequence S = ( S1, S2, . . . , Sn), then write set  
>>> (S) for the
>>> set of solution sequences in S.
>>>
>>>      distinct(S) = (Si | Si != Sj for all i != j) and set 
>>> (distinct (S)) = set(S)
>>> """
>> That doesn't help (though it is nicer) until "!=" is defined.
>
> Quite - I noted that below.

Ah, yes. I didn't make the connection (being confused with something  
isn't exactly the same as being undefined). But good, we're on the  
same page on this point at least.

> And the defn still has a bug in it :-(
>
>>> There is a layering with
>>>
>>>   * modifiers
>>>   * algebra
>>>   * BGP matching
>>>
>>> so DISTINCT is not directly referring to the matching but to the   
>>> solutions.
>> Er...I don't understand what you mean here. I only think of  
>> DISTINCT  as referring to the end solutions, that is, what is  
>> ultimately  reported back from the evaluation of a query. This may  
>> require work  at various stages of the processing, I suppose, but  
>> I'd imagine that  that would be merely optimization.
>
> My expectation is that DISTINCT is related to the effects of  
> pattern as per my  earlier example, not that it is an additional  
> semantic condition on the result set.

So, this is a point of disagreement. DISTINCTing the results  
certainly has computational burdens, but not the ones you've pointed  
out (and even those can be reduced in a number of ways, I think, but  
that's a separate point).

Has the group formally decided on the interpretation?

>>> So it's that "!=" :: I think it would be better to use language   
>>> here and not
>>> "!=" because it might imply a specific relationship to "!=" in   
>>> filters.
>>> "not the same" should mean "not the same when doing graph  
>>> pattern  matching"
>> I don't understand this, though I agree for the need of a  
>> specific  definition instead of relying on undefined symbols or  
>> words.
>>> D-entailment is not required of all systems.
>> Then I think we need a mechanism to indicate when this is required  
>> or  not. If D-entailment is not done, does that mean all tests  
>> involving  numeric entities fail?
>
> That mechanism is the service description, surely?

There is no service description that I'm aware of :) I have grown  
more and more convinced that it is better to have these things in the  
query language and let the server fault as not handled (we can have  
redundant information in the service description so you don't have to  
experiment).

> In fact, in general, the exact characteristics of a service  
> endpoint will be in the service description including entailment  
> regime.

I think it is reasonable to let people determine on a query by query  
basis what sort of entailment they want from a server that is capable.

At the very least we must make it crystal clear what people can  
expect and how they can ground their expectations.

[snip]
>>> Bijan Parsia wrote:
>> [snip]
>>>> BNODES:
>>>>
>>>> BNodes are much harder, overall. Consider the following answer set:
>>>>
>>>> 1)    ?x        ?y
>>>>     _:x        :mochi
>>>>     :Bijan    :mochi
>>>>
>>>> One (distinct) answer, or two?
>>> Can't tell - it depends on the data and isn't a characteristic  
>>> of  the result set alone.
>> This is what I don't understand. It seems clearly a characteristic  
>> of  the result set alone.
>
> The {?x ?p ?y} has a projection as well as DISTINCT,

Yes, which I think it is the right approach. I didn't realize that we  
disagreed on that.

> so let's consider:
>
> SELECT DISTINCT ?x ?y { ?x :p ?y }
>
> Data 1:
> _:x    :p :m .
> _:x    :q :r .
> :Bijan :p :m .
>
> Results 1:
> x/_:x and ?x/:Bijan
>
> Presence of a blank node in the results, indicates something in the  
> data.

Shouldn't the results be:

<?x/_:x, ?y/:m>, <?x:bijan, ?y/:m>

Still think it's redundant. We actually don't know that there is  
anything distinct in the data. For all we know, :Bijan :q :r.

There are other *assertions* in the data, but I don't see that that  
should affect the answers.

Consider the following projection:

SELECT DISTINCT ?x {?x :p ?y}

:Bijan :p :m.
:Bijan :p :r.

Wouldn't we be horribly surprised to get:
<?x/:Bijan><?x/:Bijan>?

Yet each Bijan indicates that there *some* distinctions in the data.

> Data 2:
> _:x    :p :m .
> :Bijan :p :m .
>
> Results 2:
> x/_:x and ?x/:Bijan
>
> Redundancy in the results because of redundancy in the graph  
> queried seems consistent.

If it's redundancy in the results, then I don't see that it's a  
supportable reading of DISTINCT, given RDF semantics. Why this and  
not the two :Bijan's above?

(Two Bijan's are better than one!)

>> Consider a Constructed graph from that result set:
>> _:x :loves :mochi.
>> :Bijan :loves :mochi.
>> (Template ?x loves ?y)
>> This is clearly redundant. We can tell by the results alone.
>
> Isn't it a matter of leaning the result after template application?

So? My point is that CONSTRUCT allows us to detect BNode redundancy.  
So why is it different in SELECT DISTINCT?

>>> Placing the burden on the calculation of redundancy that  
>>> requires  inspecting the whole dataset is too much of a burden as  
>>> we have  discussed before.
>> We've discussed it in the context of the default answers (i.e.,  
>> of  non-DISTINCT). I don't recall discussing it in the context of   
>> DISTINCT. A pointer to that discussion would be helpful, thanks.
>
> The burden is the same - I don't see that it is lessened by  
> specifying DISTINCT.

Even if your understanding of results redundancy prevails, it's still  
lessoned. It's not required for every, or most queries.  
Implementations can choose to not support DISTINCT (or if we want  
more fine grained versions, we could have DISTINCT and  
REALLYREALLYDISTINCT :)).

>   But then I don't see DISTINCT as modifying the entailment regime.

I don't see how you can't, actually. Well, maybe not the entailmetn  
regime per se, but definitely the semantics of *something* in the  
picture.

> The SPARQL design I prefer has entailment in BGP matching but after  
> that it is algebraic.

I agree. But minimizing *can't* be done in the matching, because of  
the fact that algebraic operations *introduce* redundancy.

So, the very notion of redundancy is in dispute.

>   Sometime, hopefully soon, more research will show how to extend  
> entailment regimes but that isn't ripe for this version of SPARQL -  
> or be a different approach to more than just conjunctive matching.   
> And I don't believe in leaving design open if that is based on too  
> much speculation on such future designs to the limitation of the  
> current spec.  Give the research some space.

I don't believe I'm introducing a research agenda. I also believe we  
need to comport with the RDF Semantics as written, or do something  
special. I'm aware that this conflicts with common practice, indeed,  
I am *well* aware of this. But I don't think the group, as it stands,  
gets to pick and choose those constraints.

>> If you want to be permissive, why not take the attitude that the  
>> spec  has to D-entailment?
>
> [I don't follow: "permissive" ... "has to [have] D-entailment"]

That [have] is incorrect. The spec currently permits servers to  
support D-entailment, why are we so hot to forbid anti-redundancy?

> We should be open to the fact that some systems will have no D- 
> entailment, some will have some forms of D-entailment.   
> Implementations will range from the small to the large.

I agree, but we should be clearer about what their doing, when. At  
the very least, it's underspecified in the text, especially for RDF  
and RDFS.

> Implementing value-based FILTERs, but not requiring D-entailment in  
> BGPs is a compromise and one that has been shown to be workable.

It's not shown to be the only workable one, of course. And we have  
this constraint of the other documents. At the very very very least,  
I expect a fairly detailed explanation in the document. Plus, by your  
account above of redundancy, I would have thought that having data like:
	:b rdf:type rdfs:Literal.

Would mean that:
	isLiteral(:b)
Is true. It's a literal in the data.

Or how about:
	_:x rdf:type rdf:XMLLiteral

Which is plain jane RDF, no D-Entailment?
Is isLiteral(_:x) true?

>> Personally, I think we cannot avoid dealing with multiple sorts  
>> of  entailment, even in the RDF case, even aside from RDFS.
>
> [And this should be captured via service descriptions.]

We don't have them. Without *some* mechanism in place, I think we are  
underspecified. I think it should be in the query, so I can know what  
a query *means*, by analyzing the query itself. But either way, we  
need some text indicating this state of affairs clearly, or  
rectifying it.

[snip]
>>>   Supporting editors wanting to access the structural and  
>>> redundant  nature of the graph is reasonable.
>> Surely that's a pretty small market, I would think.
>
> In terms of effect on take up, I'd argue that input systems for  
> data, editors et al., are important.

Well, all I meant is that that the implementors of editors etc. are  
clearly a small subset of those who either use such systems or who  
query data. This is uncontroversial, yes?

> [[Arguing by market size can be tricky anyway:

I wasn't really, per se. I'm just pointing out that it isn't a  
central use. Actually, I'd love to see a pointer to an editor that  
uses a query language as its primary interaction with RDF. I don't  
know of any. At all.

Portals are different, but I think well served by a semantic reading.

> http://lists.w3.org/Archives/Public/public-owl-dev/2006JulSep/ 
> 0027.html
>
> """ Bijan:
> On the other, I don't think
> they [Cerebra, Racer, KAON2] have smashing runaway sales.
> """

I don't see what this has to do with anything, actually. I'm  
certainly not being inconsistent.

> although that may be a different Bijan because it comes from  
> @isr.umd.edu while this message comes from @cs.man.ac.uk.

No. It's just an effect of my not having resubscribed.

> :-)
> ]]
>
>>> It is also one that people expect to work.
>> But if they expect wrongly? The giving *semantic weight* to   
>> structural redundancy pretty clearly, I would argue, violates the   
>> semantics of RDF. And is inconsistent, since we do not respect  
>> URI  redundancy (why not?). Editors are a *very* specialized use  
>> case and  a rather dangerous one to generalize from (portals are  
>> different, I  think).
>
> I prefer to take on board that user expectation because I see it as  
> already being out there in existing toolkits and systems.

This is not an unreasonable position. But it *is* contrary to the  
specification of RDF. I'm open to changing RDF :)

> I don't think implementing SPARQL should require more work  
> (implementation effect, computational requirements) than pattern  
> matching.

Whereas I think that if we are going to continue to endorse the  
existing recommendations, that we must take them into account. We're  
in a normative, as well as descriptive, business here. I would be  
more sanguine about introducing a new style of reasoning (graph  
matching style) if it were an *additional* mode, rather than the  
prime mode.

> I want to facilitate as widespread take-up in toolkits from small  
> to large.
[snip]

Me too. But 1) I don't think the implementational challenges are as  
great as you make out, and two you go beyond that by insisting on  
your definition of DISTINCT, IMHO, since that assigns a semantic  
significance to BNodes not supported by their semantics.

And that's misleading. Misleading in a way that is in accordance with  
the dominant understanding of RDF (I had an interesting time  
explaining this to Kendall today :)), but misleading none the less.

If we are going to ditch RDF, I think we should 1) make it clear that  
that's what we are doing, and 2) make sure the wide enough community  
knows it.

Cheers,
Bijan.
Received on Monday, 14 August 2006 16:56:44 UTC