Re: ISSUE: DISTINCT is underspecified from Bijan Parsia on 2006-08-14 (public-rdf-dawg@w3.org from July to September 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Mon, 14 Aug 2006 13:21:35 +0100
To: andy.seaborne@hp.com
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <28477B59-7AB9-41ED-AD13-9ECF2A7AB274@cs.man.ac.uk>
On Aug 14, 2006, at 9:51 AM, Seaborne, Andy wrote:

> We seem to have lost some text at some time in the past: it used to  
> say:
>
> """
> Definition: Distinct Solution Sequence
>
> A Distinct Solution Sequence has no two solutions the same.
>
> For a solution sequence S = ( S1, S2, . . . , Sn), then write set 
> (S) for the
> set of solution sequences in S.
>
>      distinct(S) = (Si | Si != Sj for all i != j) and set(distinct 
> (S)) = set(S)
> """

That doesn't help (though it is nicer) until "!=" is defined.

> There is a layering with
>
>   * modifiers
>   * algebra
>   * BGP matching
>
> so DISTINCT is not directly referring to the matching but to the  
> solutions.

Er...I don't understand what you mean here. I only think of DISTINCT  
as referring to the end solutions, that is, what is ultimately  
reported back from the evaluation of a query. This may require work  
at various stages of the processing, I suppose, but I'd imagine that  
that would be merely optimization.

> So it's that "!=" :: I think it would be better to use language  
> here and not
> "!=" because it might imply a specific relationship to "!=" in  
> filters.
> "not the same" should mean "not the same when doing graph pattern  
> matching"

I don't understand this, though I agree for the need of a specific  
definition instead of relying on undefined symbols or words.

> D-entailment is not required of all systems.

Then I think we need a mechanism to indicate when this is required or  
not. If D-entailment is not done, does that mean all tests involving  
numeric entities fail?

> So, if D-entailment were done in BGP matching, it should include  
> that; if
> D-entailment were not done, it should not include that.
>
> Data:
> :x :p 1 .
> :y :p "01"^^xsd:integer .
>
> Query:
> SELECT DISTINCT ?v { ?a :p ?v }
>
> should have one solution if
>
> { :x :p ?v . :y :p ?v . }
>
> matches else it should have two.

Hmm. Again, I would have done it by analysis of the results. Need to  
think more about it. This is not an unreasonable approach but it  
seems to lead to counterintuitive results.

> Bijan Parsia wrote:
[snip]
> > BNODES:
> >
> > BNodes are much harder, overall. Consider the following answer set:
> >
> > 1)    ?x        ?y
> >     _:x        :mochi
> >     :Bijan    :mochi
> >
> > One (distinct) answer, or two?
>
> Can't tell - it depends on the data and isn't a characteristic of  
> the result set alone.

This is what I don't understand. It seems clearly a characteristic of  
the result set alone.

Consider a Constructed graph from that result set:

_:x :loves :mochi.
:Bijan :loves :mochi.

(Template ?x loves ?y)

This is clearly redundant. We can tell by the results alone.

> Placing the burden on the calculation of redundancy that requires  
> inspecting the whole dataset is too much of a burden as we have  
> discussed before.

We've discussed it in the context of the default answers (i.e., of  
non-DISTINCT). I don't recall discussing it in the context of  
DISTINCT. A pointer to that discussion would be helpful, thanks.

If you want to be permissive, why not take the attitude that the spec  
has to D-entailment?

Personally, I think we cannot avoid dealing with multiple sorts of  
entailment, even in the RDF case, even aside from RDFS.

[snip]
> > I would also like to be a very strong push in for a strong
> > anti-redundancy reading. I think 1) and 2) should have only one  
> answer
> > (if DISTINCT). The principle is that no DISTINCT set of answers  
> should
> > contain redundancy. This is akin to a lean graph, and is likely
> > similarly computationally expensive. (Note that source graph  
> leanness is
> > not sufficient, as 3) shows). Thus, I think this is more  
> characterisitc
> > of the semantics of RDF. I would encourage also text that made the
> > decision parallel that of what I've seen of SQL ALL vs. DISTICT,  
> to wit,
> > that ALL is a *computational* computational compromised and not  
> intended
> > to correspond to the "math" of the situation. For many purposes, of
> > course, that's just fine. Redundancy for time is a sensible  
> tradeoff.
> > And I applaud have predicable, "minimal" redundancy, that is, no  
> more
> > than what is in the graph. That's computationally and  
> implementationally
> > straightfoward. However, I think we should *not* encourage a  
> "semantic"
> > reading of that redundancy, where in people interpret the  
> redundancy as
> > a *significant* part of the data.
> >
> > In other words, we're not supporting editors that care about the
> > specific assertional structure of a graph.
>
> The structural access is an important use case.

For *DISTINCT* queries? I'd be surprised. However, we have to balance  
that use case against others, and against consistency with exisiting  
specifications yes?

>   Supporting editors wanting to access the structural and redundant  
> nature of the graph is reasonable.

Surely that's a pretty small market, I would think.

> It is also one that people expect to work.

But if they expect wrongly? The giving *semantic weight* to  
structural redundancy pretty clearly, I would argue, violates the  
semantics of RDF. And is inconsistent, since we do not respect URI  
redundancy (why not?). Editors are a *very* specialized use case and  
a rather dangerous one to generalize from (portals are different, I  
think).

I think it's very important that the query language not give  
*misleading* answers. Thus, I think we should have a non-redundant  
mode in some shape or form (we could have multiple semantics, for  
example, as I proposed back in the day), or we should challenge the  
current RDF semantics *explicitly*. Obviously, this is not in our  
charter, so we have to at least kick it up a level.

I think, from a deployment and practice point of view, that the  
existential reading of BNodes is *wrong*. That is, the RDF working  
group made the *wrong* choice in formalizing them that way. But it  
*is* the choice made, and there are some interesting aspects of it.  
But I don't think we get to eat our cake and say that we're toasting  
marshmallows.

Cheers,
Bijan.
Received on Monday, 14 August 2006 12:21:46 UTC