RE: DISTINCT (was: Re: Queries over multiple graphs) from Seaborne, Andy on 2004-09-29 (public-rdf-dawg@w3.org from July to September 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 29 Sep 2004 14:11:23 +0100
To: "Steve Harris" <S.W.Harris@ecs.soton.ac.uk>, "RDF Data Access Working Group" <public-rdf-dawg@w3.org>
Message-ID: <8D5B24B83C6A2E4B9E7EE5FA82627DC93967F9@sdcexcea01.emea.cpqcorp.net>
-------- Original Message --------
> From: Steve Harris <>
> Date: 29 September 2004 13:49
> 
> On Wed, Sep 29, 2004 at 01:05:29PM +0100, Andy Seaborne wrote:
> > > On Tue, Sep 28, 2004 at 06:19:29 +0100, Andy Seaborne wrote:
> > > > I prefer to have explicit DISTINCT.  I don?t see having SELECT
> > > > returning duplicate rows contradicting RDF's set of statements
if
> > > > the app writer only wants some of the variables.
> > > > 
> > > > If there is no DISTINCT, then there is there is one result for
> > > > every way the query can be matched.  Because SELECT can remove
> > > > variables, it is possible the application can't tell two
> > > > solutions (table rows, results) apart - but it can if there is
> > > > "SELECT *" or SELECT with all the variables. "SELECT DISTINCT"
> > > > means no two results the same even when there fewer variables. 
> > > > Hence "SELECT DISTINCT *" is a no-op. 
> > > 
> > > Doesnt that assume that every statement in the system is unique at
> > > the triple level? That is not neccesarily the case.
> > 
> > In the sense that an RDF graph is a set of statements, every
statement
> > is unique. When querying "SELECT *" there will be one unique
solution
> > for each way the query can match.  Hence each row is different in
some
> > way.
> 
> Agreed, but the differences may not be apparent at the (s,p,o) triple
> level.
> 
> > If I understand 3Store correctly, it is as much a collection of
graphs
> > to query - and does not present a concept of the RDF model of the
> > whole collection.
> 
> That is correct, if you disallow duplicate triples.
> 
> >              It's more like having an implicit "SOURCE *" around
each
> > query pattern.
> 
> Possibly, depending on what semantics we agree for a store containing
> a set of graphs. I would prefer DAWG to not require a particular
> behaviour. In my RDQL implememtation its irrelevant as the inplicit
> DISTINCT makes them appear the same. IIUC you would like to require
that
> queries that dont use SOURCE will treat thier entire contents as a
> single 
> RDF graph with unique statements. That seems overly prescriptive to
me.
> What's the motivation?

The motivation is to make the entire contents appear as an RDF graph.
Querying the target, an app can't see which it is - a single graph or
something that is made of named containers.

For better or worse, making it an RDF graph means no duplicate
statements.  It can be implemented in the merge or by ensuring the right
number of results are returned.

The effect of query over all graphs can be achieved this way round by
    SELECT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)

but I can't see how the reverse can be done.  How to make an "RDF graph"
view work when there is either no or implicit DISTINCT.

In a RDF graph view SELECT ?x ?y WHERE (?x ?y ?z) would naturally have
dupliate rows so automatic implicit DISTINCT, which copes with "SELECT
?x ?y ?z WHERE (?x ?y ?z)" cases means that this can't be done.  The
DISTINCTness can't be controlled.

This is one difference between a collection of graphs (a sort of store
of quads where the 4th slot is the graph id) and an aggregation store
where the view of the whole collection is an RDF graph.

Looks like test case query-1 was more of a test than I thought.  I
expected that one result would be agreed.

When we get into returning RDF graphs, I think this will need to be
decided although we could make different decisions in different places.

-----------------------------

---- Graph <u1>
:a :b :c

---- Graph <u2>
:a :b :c

-----------------------------

---- Query 1
SELECT *
FROM <u1>, <u2>
WHERE (?x ?y ?z)

---- Result:

?x = :a , ?y = :b , ?z = :c

-----------------------------

> 
> > > > ?x = ...    // and no ?y
> > > > ?y = ...    // and no ?x
> > > 
> > > What about ?x=NULL, ?y=NULL and ?x=... , ?y=..., would those also
> > > be valid solutions? I think I'm not following this part. Possibly
a
> > > more concrete example would help.
> > 
> > Yes - an example would help - and I think I got the example wrong.
> > OPTIOANL is "greedy" in that if it can bind it does.  No unbound is
> > generated if an OPTIONAL can match (A [B] is A+B if B matches else
A).
> > I'll try to do an example in a sparate mail thread.
> 
> Right, OK, I think that what was confusing me.
> 
> - Steve
Received on Wednesday, 29 September 2004 13:11:56 UTC