Re: ISSUE: DISTINCT is underspecified from Seaborne, Andy on 2006-08-14 (public-rdf-dawg@w3.org from July to September 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 14 Aug 2006 16:32:01 +0100
To: Bijan Parsia <bparsia@cs.man.ac.uk>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <44E09771.8060305@hp.com>
Bijan Parsia wrote:
> On Aug 14, 2006, at 9:51 AM, Seaborne, Andy wrote:
> 
>> We seem to have lost some text at some time in the past: it used to  
>> say:
>>
>> """
>> Definition: Distinct Solution Sequence
>>
>> A Distinct Solution Sequence has no two solutions the same.
>>
>> For a solution sequence S = ( S1, S2, . . . , Sn), then write set 
>> (S) for the
>> set of solution sequences in S.
>>
>>      distinct(S) = (Si | Si != Sj for all i != j) and set(distinct 
>> (S)) = set(S)
>> """
> 
> That doesn't help (though it is nicer) until "!=" is defined.

Quite - I noted that below.

And the defn still has a bug in it :-(

> 
>> There is a layering with
>>
>>   * modifiers
>>   * algebra
>>   * BGP matching
>>
>> so DISTINCT is not directly referring to the matching but to the  
>> solutions.
> 
> Er...I don't understand what you mean here. I only think of DISTINCT  
> as referring to the end solutions, that is, what is ultimately  
> reported back from the evaluation of a query. This may require work  
> at various stages of the processing, I suppose, but I'd imagine that  
> that would be merely optimization.

My expectation is that DISTINCT is related to the effects of pattern as per my 
  earlier example, not that it is an additional semantic condition on the 
result set.

> 
>> So it's that "!=" :: I think it would be better to use language  
>> here and not
>> "!=" because it might imply a specific relationship to "!=" in  
>> filters.
>> "not the same" should mean "not the same when doing graph pattern  
>> matching"
> 
> I don't understand this, though I agree for the need of a specific  
> definition instead of relying on undefined symbols or words.
> 
>> D-entailment is not required of all systems.
> 
> Then I think we need a mechanism to indicate when this is required or  
> not. If D-entailment is not done, does that mean all tests involving  
> numeric entities fail?

That mechanism is the service description, surely?

In fact, in general, the exact characteristics of a service endpoint will be 
in the service description including entailment regime.

> 
>> So, if D-entailment were done in BGP matching, it should include  
>> that; if
>> D-entailment were not done, it should not include that.
>>
>> Data:
>> :x :p 1 .
>> :y :p "01"^^xsd:integer .
>>
>> Query:
>> SELECT DISTINCT ?v { ?a :p ?v }
>>
>> should have one solution if
>>
>> { :x :p ?v . :y :p ?v . }
>>
>> matches else it should have two.
> 
> Hmm. Again, I would have done it by analysis of the results. Need to  
> think more about it. This is not an unreasonable approach but it  
> seems to lead to counterintuitive results.
> 
>> Bijan Parsia wrote:
> [snip]
>>> BNODES:
>>>
>>> BNodes are much harder, overall. Consider the following answer set:
>>>
>>> 1)    ?x        ?y
>>>     _:x        :mochi
>>>     :Bijan    :mochi
>>>
>>> One (distinct) answer, or two?
>> Can't tell - it depends on the data and isn't a characteristic of  
>> the result set alone.
> 
> This is what I don't understand. It seems clearly a characteristic of  
> the result set alone.

The {?x ?p ?y} has a projection as well as DISTINCT, so let's consider:

SELECT DISTINCT ?x ?y { ?x :p ?y }

Data 1:
_:x    :p :m .
_:x    :q :r .
:Bijan :p :m .

Results 1:
x/_:x and ?x/:Bijan

Presence of a blank node in the results, indicates something in the data.

Data 2:
_:x    :p :m .
:Bijan :p :m .

Results 2:
x/_:x and ?x/:Bijan

Redundancy in the results because of redundancy in the graph queried seems 
consistent.


> 
> Consider a Constructed graph from that result set:
> 
> _:x :loves :mochi.
> :Bijan :loves :mochi.
> 
> (Template ?x loves ?y)
> 
> This is clearly redundant. We can tell by the results alone.

Isn't it a matter of leaning the result after template application?

> 
>> Placing the burden on the calculation of redundancy that requires  
>> inspecting the whole dataset is too much of a burden as we have  
>> discussed before.
> 
> We've discussed it in the context of the default answers (i.e., of  
> non-DISTINCT). I don't recall discussing it in the context of  
> DISTINCT. A pointer to that discussion would be helpful, thanks.

The burden is the same - I don't see that it is lessened by specifying 
DISTINCT.  But then I don't see DISTINCT as modifying the entailment regime. 
The SPARQL design I prefer has entailment in BGP matching but after that it is 
algebraic.  Sometime, hopefully soon, more research will show how to extend 
entailment regimes but that isn't ripe for this version of SPARQL - or be a 
different approach to more than just conjunctive matching.  And I don't 
believe in leaving design open if that is based on too much speculation on 
such future designs to the limitation of the current spec.  Give the research 
some space.

> 
> If you want to be permissive, why not take the attitude that the spec  
> has to D-entailment?

[I don't follow: "permissive" ... "has to [have] D-entailment"]

We should be open to the fact that some systems will have no D-entailment, 
some will have some forms of D-entailment.  Implementations will range from 
the small to the large.

Implementing value-based FILTERs, but not requiring D-entailment in BGPs is a 
compromise and one that has been shown to be workable.

> 
> Personally, I think we cannot avoid dealing with multiple sorts of  
> entailment, even in the RDF case, even aside from RDFS.

[And this should be captured via service descriptions.]

> 
> [snip]
>>> I would also like to be a very strong push in for a strong
>>> anti-redundancy reading. I think 1) and 2) should have only one  
>> answer
>>> (if DISTINCT). The principle is that no DISTINCT set of answers  
>> should
>>> contain redundancy. This is akin to a lean graph, and is likely
>>> similarly computationally expensive. (Note that source graph  
>> leanness is
>>> not sufficient, as 3) shows). Thus, I think this is more  
>> characterisitc
>>> of the semantics of RDF. I would encourage also text that made the
>>> decision parallel that of what I've seen of SQL ALL vs. DISTICT,  
>> to wit,
>>> that ALL is a *computational* computational compromised and not  
>> intended
>>> to correspond to the "math" of the situation. For many purposes, of
>>> course, that's just fine. Redundancy for time is a sensible  
>> tradeoff.
>>> And I applaud have predicable, "minimal" redundancy, that is, no  
>> more
>>> than what is in the graph. That's computationally and  
>> implementationally
>>> straightfoward. However, I think we should *not* encourage a  
>> "semantic"
>>> reading of that redundancy, where in people interpret the  
>> redundancy as
>>> a *significant* part of the data.
>>>
>>> In other words, we're not supporting editors that care about the
>>> specific assertional structure of a graph.
>> The structural access is an important use case.
> 
> For *DISTINCT* queries? I'd be surprised. However, we have to balance  
> that use case against others, and against consistency with exisiting  
> specifications yes?
> 
>>   Supporting editors wanting to access the structural and redundant  
>> nature of the graph is reasonable.
> 
> Surely that's a pretty small market, I would think.

In terms of effect on take up, I'd argue that input systems for data, editors 
et al., are important.


[[Arguing by market size can be tricky anyway:

http://lists.w3.org/Archives/Public/public-owl-dev/2006JulSep/0027.html

""" Bijan:
On the other, I don't think
they [Cerebra, Racer, KAON2] have smashing runaway sales.
"""
although that may be a different Bijan because it comes from @isr.umd.edu 
while this message comes from @cs.man.ac.uk.

:-)
]]

> 
>> It is also one that people expect to work.
> 
> But if they expect wrongly? The giving *semantic weight* to  
> structural redundancy pretty clearly, I would argue, violates the  
> semantics of RDF. And is inconsistent, since we do not respect URI  
> redundancy (why not?). Editors are a *very* specialized use case and  
> a rather dangerous one to generalize from (portals are different, I  
> think).

I prefer to take on board that user expectation because I see it as already 
being out there in existing toolkits and systems.  I don't think implementing 
SPARQL should require more work (implementation effect, computational 
requirements) than pattern matching.  I want to facilitate as widespread 
take-up in toolkits from small to large.

> 
> I think it's very important that the query language not give  
> *misleading* answers.
 >  Thus, I think we should have a non-redundant
> mode in some shape or form (we could have multiple semantics, for  
> example, as I proposed back in the day), or we should challenge the  
> current RDF semantics *explicitly*. Obviously, this is not in our  
> charter, so we have to at least kick it up a level.
> 
> I think, from a deployment and practice point of view, that the  
> existential reading of BNodes is *wrong*. That is, the RDF working  
> group made the *wrong* choice in formalizing them that way. But it  
> *is* the choice made, and there are some interesting aspects of it.  
> But I don't think we get to eat our cake and say that we're toasting  
> marshmallows.
> 
> Cheers,
> Bijan.
Received on Monday, 14 August 2006 15:32:33 UTC