Concrete vs. existential semantics from Fred Zemke on 2006-07-06 (public-rdf-dawg@w3.org from July to September 2006)

From: Fred Zemke <fred.zemke@oracle.com>
Date: Thu, 06 Jul 2006 12:56:23 -0700
To: public-rdf-dawg@w3.org
Message-ID: <44AD6AE7.7050006@oracle.com>
I have been advocating for strict definitions of the number of
rows returned by queries.  As I understand it, Andy Seaborne
has advocated an opposite view, that SPARQL should not define
precisely how many duplicates are returned by a query.
For example, in
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0005.html
"In general, it isn't possible to conclude anything about numbers
of things in RDF.  It is in OWL."
I have also heard the opinion that it does not matter whether
duplicates are eliminated from a UNION or not; I don't have
a name or message to cite for that opinion.  More generally,
I think there is an opinion that all SPARQL cares about is
that the result sequence, after eliminating duplicates, is correct.
Thus the result of a SELECT is not precisely defined;
only SELECT DISTINCT is.

In this message I want to start a discussion on this.
As an initial foray, I will frame the question in terms
of "concrete" vs. "existential" semantics.

I grant that it is difficult to impossible to be sure that
two seemingly-different IRIs refer to distinct things. 
I also grant that it is difficult to impossible to be sure that
two seemingly distinct blank nodes, conceived of as existentials,
are known to be distinct. However, I wonder whether it is a
good idea to base our semantics exclusively on these "existential"
insights.

I think the naive view is that two things are distinct if they
look distinct.  Two IRIs that are spelled differently are
different.  Two blank nodes with different node identity
are different.  (Blank node identifiers are proxies for node
identity; two blank nodes with different identifiers are
different).

I think that in many instances, the users will want this kind of
concrete interpretation of an RDF graph.  Further, I believe
that when one is working with a concrete interpretation,
duplicates may carry semantic meaning and it is important to
define precisely how many duplicates are returned. I especially
believe this is true when there are financial figures involved.

For example, imagine a purchase order encoded in RDF.
Each purchase order has an IRI.  Various facts about the PO
are assembled using verbs: bill-to, ship-to, and the line items.
Since bill-to, ship-to and line items are all compound objects,
they may be represented by blank nodes, which in turn connect
via various verbs to literals or IRIs.  Let's look at the line items
in particular.  A line item consists of a part number (an IRI),
a quantity (an xsd:integer), and a unit price (an xsd:decimal).
The user wants to find the total price of a particular PO.
The query looks something like this:

SELECT ?quantity ?price
WHERE some:IRI po:po _:lineitem .
      _:lineitem po:quantity ?quantity .
      _:lineitem po:price ?price .

Since SPARQL has no aggregates or expressions in its SELECT
list, the user intends to simply fetch all rows, multiply
?quantity * ?price and take the sum himself.

Now it can happen in a PO that the quantity and price of
two line items are identical.  However, suppressing such
duplicates would be fatal to this application.

Note that adding the part number to the SELECT list will
not necessarily save the query, since the combination of
part number, quantity and price is still not a guaranteed
unique key for line items.  The user is relying on distinct
blank nodes to represent distinct line items.

Of course, from the point of view of "RDF Semantics"
that would be a redundant graph, for example, one that
asserts "There exists a line item whose part is XYZ,
quantity is 1 and price is 10.99" and asserts again
"There exists a line item whose part is XYZ, quantity is 1 and
price is 10.99".  Thus one could say that this is a misuse
of RDF.  This may be technically true, but I wonder if insisting
on this point will really serve the users.  If you read the RDF
Primer, the application design above makes sense. You have a line
item; you don't want to bother creating an IRI for each line
item; so you make a blank node for each line item.
"RDF Semantics", on the other hand, is a dense document
with talk about hypothetical universes that are interpretations
of a graph.  This is not the kind of material that will make
its way into seminars, courses, how-to books, etc.

The early days of relational databases encountered the same
problem.  The theorists said a relational table is a set,
therefore it can have no duplicates, therefore it is up to
the user to insert some additional piece of information to
distinguish two otherwise-identical line items, to provide a
unique key.  Sounds great in theory; however, the vendors
discovered that they had to accomodate the naive view that
each row has its own identity and is distinct, without requiring
a unique key.

A slightly different response is that RDF and SPARQL are not
targeted at such applications.  However, the introduction to
"OWL web ontology language guide" poses this scenario:
"consider actually assigning a software agent the task of
making a coherent set of travel arrangements."  If eventually
RDF databases and SPARQL queries are part of such a software
agent, then it will be necessary to make concrete assurances
about the total price of a travel plan.  In addition, the vision
is that the dataset will be aggregated from many sites, which
means that there will not be a central authority to impose
strict existential semantics.

My suggestion is that we consider some syntactic way to
support both a "concrete" interpretation and an "existential"
interpretation.

My tentative initial solution is a three-way switch:
SELECT DISTINCT, SELECT ALL and SELECT LAX.  
SELECT DISTINCT promises to remove duplicates,
SELECT ALL promises to deliver all duplicates,
and SELECT LAX makes no promises either way.  (Anyone have
a better keyword for this choice?)

I don't believe this is the complete solution to the issue.
The reason is that the issue of duplicates becomes more
complicated when using OWL entailment.  OWL permits the
deduction that two seemingly distinct IRIs or blank nodes
are in fact equal.  For example, if the reasoner can deduce
that some:IRI1 = some:IRI2, what should the reasoner return
for SELECT ALL?  Does it return both even though it knows they
are equal?  If not, how does the user frame a query to ask
for all synonyms of some:IRI1?  What should the
reasoner return for SELECT DISTINCT?  Does it pick one of the
two arbitrarily?

Fred
Received on Thursday, 6 July 2006 19:56:37 UTC