Re: Concrete vs. existential semantics from Pat Hayes on 2006-07-06 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 6 Jul 2006 16:50:12 -0500
To: Fred Zemke <fred.zemke@oracle.com>
Cc: public-rdf-dawg@w3.org
Message-Id: <p06230904c0d3246af758@[10.100.0.28]>
>I have been advocating for strict definitions of the number of
>rows returned by queries.  As I understand it, Andy Seaborne
>has advocated an opposite view, that SPARQL should not define
>precisely how many duplicates are returned by a query.
>For example, in
>http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0005.html
>"In general, it isn't possible to conclude anything about numbers
>of things in RDF.  It is in OWL."
>I have also heard the opinion that it does not matter whether
>duplicates are eliminated from a UNION or not; I don't have
>a name or message to cite for that opinion.  More generally,
>I think there is an opinion that all SPARQL cares about is
>that the result sequence, after eliminating duplicates, is correct.
>Thus the result of a SELECT is not precisely defined;
>only SELECT DISTINCT is.
>
>In this message I want to start a discussion on this.
>As an initial foray, I will frame the question in terms
>of "concrete" vs. "existential" semantics.
>
>I grant that it is difficult to impossible to be sure that
>two seemingly-different IRIs refer to distinct things. I also grant 
>that it is difficult to impossible to be sure that
>two seemingly distinct blank nodes, conceived of as existentials,
>are known to be distinct. However, I wonder whether it is a
>good idea to base our semantics exclusively on these "existential"
>insights.
>
>I think the naive view is that two things are distinct if they
>look distinct.  Two IRIs that are spelled differently are
>different.

The IRIs are distinct, sure: but what they refer to might not be.

>  Two blank nodes with different node identity
>are different.

If the identifiers are in different document scopes, that is a safe 
presumption; and if they are different nodeIDs in the same scope, 
then they identify different *nodes*, yes. But whether or not those 
different nodes *refer to* the same resource, is an open question. 
They might or they might not.

It isn't clear whether you are talking about distinctness of the 
names or of the things the names refer to. Which do you mean?

>  (Blank node identifiers are proxies for node
>identity; two blank nodes with different identifiers are
>different).
>
>I think that in many instances, the users will want this kind of
>concrete interpretation of an RDF graph.

What do you mean by a concrete interpretation here? You seem to be 
talking, above, not bout the interpretations, but about the syntactic 
structure of the graphs themselves.

Do you mean, users will want to be able to apply what is often called 
a unique name assumption (that distinct names refer to distinct 
things?) I agree this many applications do make this assumption, but 
for other kinds of application (eg text scraping) it is fatal. RDF 
does not, and should not, make this assumption.

>  Further, I believe
>that when one is working with a concrete interpretation,
>duplicates may carry semantic meaning

Not any sanctioned by the RDF/S or SPARQL specs.

>and it is important to
>define precisely how many duplicates are returned. I especially
>believe this is true when there are financial figures involved.
>
>For example, imagine a purchase order encoded in RDF.
>Each purchase order has an IRI.  Various facts about the PO
>are assembled using verbs: bill-to, ship-to, and the line items.
>Since bill-to, ship-to and line items are all compound objects,
>they may be represented by blank nodes

What makes you conclude that 'compound objects' may be represented by 
blank nodes? There is a mistake lurking here, that blank nodes are a 
kind of data structure. This is a bad RDF design. You should have 
IRIs for the line items if you want them to be distinguishable 
reliably, and want to be able to refer to them.

>, which in turn connect
>via various verbs to literals or IRIs.  Let's look at the line items
>in particular.  A line item consists of a part number (an IRI),
>a quantity (an xsd:integer), and a unit price (an xsd:decimal).
>The user wants to find the total price of a particular PO.
>The query looks something like this:
>
>SELECT ?quantity ?price
>WHERE some:IRI po:po _:lineitem .
>      _:lineitem po:quantity ?quantity .
>      _:lineitem po:price ?price .
>
>Since SPARQL has no aggregates or expressions in its SELECT
>list, the user intends to simply fetch all rows, multiply
>?quantity * ?price and take the sum himself.
>
>Now it can happen in a PO that the quantity and price of
>two line items are identical.  However, suppressing such
>duplicates would be fatal to this application.

The fatal mistake happened earlier in the design, when you treated 
line items as mere existential appendages to POs. SPARQL can't be 
expected to rescue a badly designed RDF application.

>Note that adding the part number to the SELECT list will
>not necessarily save the query, since the combination of
>part number, quantity and price is still not a guaranteed
>unique key for line items.  The user is relying on distinct
>blank nodes to represent distinct line items.

And they should not have done.

>Of course, from the point of view of "RDF Semantics"

I would appreciate not seeing the scare quotes. (I wonder, why do 
people seem to think that semantics are fictional or optional? If I 
were to use scare quotes when referring to "XML syntax" with similar 
implied disdain, would this be an argument for allowing unbalanced 
parentheses in an XML application?)

>that would be a redundant graph, for example, one that
>asserts "There exists a line item whose part is XYZ,
>quantity is 1 and price is 10.99" and asserts again
>"There exists a line item whose part is XYZ, quantity is 1 and
>price is 10.99".  Thus one could say that this is a misuse
>of RDF.

Right, exactly. It would be, and IMO it is a misuse that we should 
actively discourage.

>  This may be technically true,

No, it is simply TRUE. That is what the specs say.

>but I wonder if insisting
>on this point will really serve the users.

Yes, in the long run it will, because it will force them to write RDF 
applications that actually work correctly according to the RDF specs, 
with actual RDF engines, instead of poorly constructed RDF which will 
produce accounting errors.

>  If you read the RDF
>Primer, the application design above makes sense.

The primer is, well, a primer. If ALL you have read is the primer, 
then you shouldn't be implementing applications that work with real 
money.

>You have a line
>item; you don't want to bother creating an IRI for each line
>item; so you make a blank node for each line item.
>"RDF Semantics", on the other hand, is a dense document
>with talk about hypothetical universes that are interpretations
>of a graph.  This is not the kind of material that will make
>its way into seminars, courses, how-to books, etc.

Well, actually it already has. In fact, ironically, a colleague of 
yours recently told me how useful the RDF semantics had been for 
Oracle's own RDF development work.

>The early days of relational databases encountered the same
>problem.  The theorists said a relational table is a set,
>therefore it can have no duplicates, therefore it is up to
>the user to insert some additional piece of information to
>distinguish two otherwise-identical line items, to provide a
>unique key.  Sounds great in theory; however, the vendors
>discovered that they had to accomodate the naive view that
>each row has its own identity and is distinct, without requiring
>a unique key.
>
>A slightly different response is that RDF and SPARQL are not
>targeted at such applications.  However, the introduction to
>"OWL web ontology language guide" poses this scenario:
>"consider actually assigning a software agent the task of
>making a coherent set of travel arrangements."  If eventually
>RDF databases and SPARQL queries are part of such a software
>agent, then it will be necessary to make concrete assurances
>about the total price of a travel plan.  In addition, the vision
>is that the dataset will be aggregated from many sites, which
>means that there will not be a central authority to impose
>strict existential semantics.

You have this exactly backwards. The existential semantics is the 
'un-strict' case, where you are not authorized to make risky 
inferences precisely because there is no central authority to impose 
a unique name assumption, to warrant you against the risk.

Maybe your database treats distinct IRIs as referring to distinct 
entities, but someone else's RDF might use ex:phayes and ex:PatHayes 
to both refer to me, and know enough about owl:sameAs to be able to 
handle this. And here am I trying to use RDF from both sources: now 
what do I do about uniqueness of names? Can I infer that [ex:phayes 
and ex:PatHayes] are two people because they would have been if *you* 
had used the IRIs? Or (since IRIs have global scope) would your 
database just have been flat wrong if it had used both of these 
names? With blank nodes the situation is worse, since there are so 
many ways to infer that distinct bnodes co-refer that it is hard to 
count them. Maybe I know that one of your RDF properties is reverse 
functional...

>My suggestion is that we consider some syntactic way to
>support both a "concrete" interpretation and an "existential"
>interpretation.

My suggestion is that we stick to the specs, and that SPARQL should 
respect the RDF semantics as published, and not sanction 
'alternative' semantics. Having alternative semantics for a global 
interchange notation is like being slightly pregnant.

>My tentative initial solution is a three-way switch:
>SELECT DISTINCT, SELECT ALL and SELECT LAX.  SELECT DISTINCT 
>promises to remove duplicates,
>SELECT ALL promises to deliver all duplicates,
>and SELECT LAX makes no promises either way.  (Anyone have
>a better keyword for this choice?)
>
>I don't believe this is the complete solution to the issue.
>The reason is that the issue of duplicates becomes more
>complicated when using OWL entailment.  OWL permits the
>deduction that two seemingly distinct IRIs or blank nodes
>are in fact equal.

No, it allows the deduction that *what two distinct IRIs refer to* 
are equal. You are confusing use and mention.

>  For example, if the reasoner can deduce
>that some:IRI1 = some:IRI2, what should the reasoner return
>for SELECT ALL?

What reasoner? A query answering service is not the same thing as a 
reasoner. I suggest we keep the two categories of engine distinct, 
precisely to avoid getting embroiled in tar-pit issues like this. If 
query answering is supposed to perform arbitrary OWL reasoning, we 
are going to have some very long waits for answers.

>  Does it return both even though it knows they
>are equal?  If not, how does the user frame a query to ask
>for all synonyms of some:IRI1?

Select ?x where (?x owl:sameAs the:IRI)

(assuming OWL rather than RDF entailment for the query answer 
definition, of course.)

>  What should the
>reasoner return for SELECT DISTINCT?

First we have to decide what that means. Do you mean distinct 
bindings, or distinct referents?

Pat


>  Does it pick one of the
>two arbitrarily?
>
>Fred


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 6 July 2006 21:50:24 UTC