General problem with cardinality of results in CR 6 April 2006 from Fred Zemke on 2006-06-08 (public-rdf-dawg@w3.org from April to June 2006)

From: Fred Zemke <fred.zemke@oracle.com>
Date: Thu, 08 Jun 2006 23:57:40 +0000
To: public-rdf-dawg@w3.org
Message-ID: <4488B936.10705@oracle.com>

The specification is imprecise about the cardinality of solutions
in several places.  I have heard it argued that all that matters is
the set of solutions (ie, duplicates can be dropped as an
implementation dependent or defined detail).  The problem with that
position is that the number of duplicates may be semantically
meaninful to the user.

This issue will become acute when we add aggregates, which has been
deferred to a future release.  When a user performs a count or sum,
the user expects precise semantics about the number of duplicates
to be counted or summed.  If we fail to specify those semantics in
this edition, then we will encourage differing implementations, which
then will have entrenched positions on cardinality issues when we try
to add the aggregates.  Therefore it is important to resolve the
cardinality issues at this stage.

Even in advance of adding aggregates, the user may be interested
in computing them on his own using the results of a SPARQL query.
To assure portable and interoperable results, we need to define
the number of duplicates precisely.

I see at least the following issues on cardinality:

a) the number of solutions to the empty pattern,
SELECT ?A WHERE {}.  I can see arguments for 0 solutions,
1 solution (the one that makes ?A undefined), n solutions where
n is the cardinality of the scoping set (one for each possible
binding of ?A) or n+1 (one for each possible binding, plus one in
which ?A is not bound).

b) the number of solutions to a UNION, for example,
SELECT ?A WHERE { { ?A ?A ?A } UNION { ?A ?A ?A } }

c) the number of solutions when a triple pattern includes a blank
node. For example SELECT ?A WHERE { ?A n:v _:B } on the graph
G = { (n:a, n:v, "1"), (n:a, n:v, "2") }.
Does this have one solution or two? 

Argument in favor of one solution:
there is only one mapping S from the set of variables to the set of
RDF terms such that S( ?A n:v _:B ) is a triple that can be merged
into G to produce a new graph that is simply entailed by G.  This is
using the definition of basic graph pattern E-matching in section
2.5.1 "General framework".

Argument in favor of two solutions: Section 2.5.2 "SPARQL basic
graph pattern matching" last paragraph says that under simple
entailment, pattern matching can be done by mapping both variables
and SPARQL blank nodes to RDF terms, testing to see if the result of
the mapping is a subgraph of G.  Under this formulation, there are
two solutions, one that maps _:B to "1" and the other that maps _:B
to "2".  This paragraph goes on to say that solutions are formed
by restricting such mappings to just the set of variables.  What
is unclear is whether the act of restricting involves discarding
duplicates.  Note that the fact that there is a DISTINCT modifier
shows that one can not presume that duplicates are discarded.

I am posting separate, more detailed comments on specific sections
with cardinality issues.

Fred

Received on Friday, 9 June 2006 05:40:17 UTC