Re: Concrete vs. existential semantics

Wow, this time I mostly agree with Pat!
--e.

On 6 Jul 2006, at 23:50, Pat Hayes wrote:

>
>> I have been advocating for strict definitions of the number of
>> rows returned by queries.  As I understand it, Andy Seaborne
>> has advocated an opposite view, that SPARQL should not define
>> precisely how many duplicates are returned by a query.
>> For example, in
>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/ 
>> 0005.html
>> "In general, it isn't possible to conclude anything about numbers
>> of things in RDF.  It is in OWL."
>> I have also heard the opinion that it does not matter whether
>> duplicates are eliminated from a UNION or not; I don't have
>> a name or message to cite for that opinion.  More generally,
>> I think there is an opinion that all SPARQL cares about is
>> that the result sequence, after eliminating duplicates, is correct.
>> Thus the result of a SELECT is not precisely defined;
>> only SELECT DISTINCT is.
>>
>> In this message I want to start a discussion on this.
>> As an initial foray, I will frame the question in terms
>> of "concrete" vs. "existential" semantics.
>>
>> I grant that it is difficult to impossible to be sure that
>> two seemingly-different IRIs refer to distinct things. I also  
>> grant that it is difficult to impossible to be sure that
>> two seemingly distinct blank nodes, conceived of as existentials,
>> are known to be distinct. However, I wonder whether it is a
>> good idea to base our semantics exclusively on these "existential"
>> insights.
>>
>> I think the naive view is that two things are distinct if they
>> look distinct.  Two IRIs that are spelled differently are
>> different.
>
> The IRIs are distinct, sure: but what they refer to might not be.
>
>>  Two blank nodes with different node identity
>> are different.
>
> If the identifiers are in different document scopes, that is a safe  
> presumption; and if they are different nodeIDs in the same scope,  
> then they identify different *nodes*, yes. But whether or not those  
> different nodes *refer to* the same resource, is an open question.  
> They might or they might not.
>
> It isn't clear whether you are talking about distinctness of the  
> names or of the things the names refer to. Which do you mean?
>
>>  (Blank node identifiers are proxies for node
>> identity; two blank nodes with different identifiers are
>> different).
>>
>> I think that in many instances, the users will want this kind of
>> concrete interpretation of an RDF graph.
>
> What do you mean by a concrete interpretation here? You seem to be  
> talking, above, not bout the interpretations, but about the  
> syntactic structure of the graphs themselves.
>
> Do you mean, users will want to be able to apply what is often  
> called a unique name assumption (that distinct names refer to  
> distinct things?) I agree this many applications do make this  
> assumption, but for other kinds of application (eg text scraping)  
> it is fatal. RDF does not, and should not, make this assumption.
>
>>  Further, I believe
>> that when one is working with a concrete interpretation,
>> duplicates may carry semantic meaning
>
> Not any sanctioned by the RDF/S or SPARQL specs.
>
>> and it is important to
>> define precisely how many duplicates are returned. I especially
>> believe this is true when there are financial figures involved.
>>
>> For example, imagine a purchase order encoded in RDF.
>> Each purchase order has an IRI.  Various facts about the PO
>> are assembled using verbs: bill-to, ship-to, and the line items.
>> Since bill-to, ship-to and line items are all compound objects,
>> they may be represented by blank nodes
>
> What makes you conclude that 'compound objects' may be represented  
> by blank nodes? There is a mistake lurking here, that blank nodes  
> are a kind of data structure. This is a bad RDF design. You should  
> have IRIs for the line items if you want them to be distinguishable  
> reliably, and want to be able to refer to them.
>
>> , which in turn connect
>> via various verbs to literals or IRIs.  Let's look at the line items
>> in particular.  A line item consists of a part number (an IRI),
>> a quantity (an xsd:integer), and a unit price (an xsd:decimal).
>> The user wants to find the total price of a particular PO.
>> The query looks something like this:
>>
>> SELECT ?quantity ?price
>> WHERE some:IRI po:po _:lineitem .
>>      _:lineitem po:quantity ?quantity .
>>      _:lineitem po:price ?price .
>>
>> Since SPARQL has no aggregates or expressions in its SELECT
>> list, the user intends to simply fetch all rows, multiply
>> ?quantity * ?price and take the sum himself.
>>
>> Now it can happen in a PO that the quantity and price of
>> two line items are identical.  However, suppressing such
>> duplicates would be fatal to this application.
>
> The fatal mistake happened earlier in the design, when you treated  
> line items as mere existential appendages to POs. SPARQL can't be  
> expected to rescue a badly designed RDF application.
>
>> Note that adding the part number to the SELECT list will
>> not necessarily save the query, since the combination of
>> part number, quantity and price is still not a guaranteed
>> unique key for line items.  The user is relying on distinct
>> blank nodes to represent distinct line items.
>
> And they should not have done.
>
>> Of course, from the point of view of "RDF Semantics"
>
> I would appreciate not seeing the scare quotes. (I wonder, why do  
> people seem to think that semantics are fictional or optional? If I  
> were to use scare quotes when referring to "XML syntax" with  
> similar implied disdain, would this be an argument for allowing  
> unbalanced parentheses in an XML application?)
>
>> that would be a redundant graph, for example, one that
>> asserts "There exists a line item whose part is XYZ,
>> quantity is 1 and price is 10.99" and asserts again
>> "There exists a line item whose part is XYZ, quantity is 1 and
>> price is 10.99".  Thus one could say that this is a misuse
>> of RDF.
>
> Right, exactly. It would be, and IMO it is a misuse that we should  
> actively discourage.
>
>>  This may be technically true,
>
> No, it is simply TRUE. That is what the specs say.
>
>> but I wonder if insisting
>> on this point will really serve the users.
>
> Yes, in the long run it will, because it will force them to write  
> RDF applications that actually work correctly according to the RDF  
> specs, with actual RDF engines, instead of poorly constructed RDF  
> which will produce accounting errors.
>
>>  If you read the RDF
>> Primer, the application design above makes sense.
>
> The primer is, well, a primer. If ALL you have read is the primer,  
> then you shouldn't be implementing applications that work with real  
> money.
>
>> You have a line
>> item; you don't want to bother creating an IRI for each line
>> item; so you make a blank node for each line item.
>> "RDF Semantics", on the other hand, is a dense document
>> with talk about hypothetical universes that are interpretations
>> of a graph.  This is not the kind of material that will make
>> its way into seminars, courses, how-to books, etc.
>
> Well, actually it already has. In fact, ironically, a colleague of  
> yours recently told me how useful the RDF semantics had been for  
> Oracle's own RDF development work.
>
>> The early days of relational databases encountered the same
>> problem.  The theorists said a relational table is a set,
>> therefore it can have no duplicates, therefore it is up to
>> the user to insert some additional piece of information to
>> distinguish two otherwise-identical line items, to provide a
>> unique key.  Sounds great in theory; however, the vendors
>> discovered that they had to accomodate the naive view that
>> each row has its own identity and is distinct, without requiring
>> a unique key.
>>
>> A slightly different response is that RDF and SPARQL are not
>> targeted at such applications.  However, the introduction to
>> "OWL web ontology language guide" poses this scenario:
>> "consider actually assigning a software agent the task of
>> making a coherent set of travel arrangements."  If eventually
>> RDF databases and SPARQL queries are part of such a software
>> agent, then it will be necessary to make concrete assurances
>> about the total price of a travel plan.  In addition, the vision
>> is that the dataset will be aggregated from many sites, which
>> means that there will not be a central authority to impose
>> strict existential semantics.
>
> You have this exactly backwards. The existential semantics is the  
> 'un-strict' case, where you are not authorized to make risky  
> inferences precisely because there is no central authority to  
> impose a unique name assumption, to warrant you against the risk.
>
> Maybe your database treats distinct IRIs as referring to distinct  
> entities, but someone else's RDF might use ex:phayes and  
> ex:PatHayes to both refer to me, and know enough about owl:sameAs  
> to be able to handle this. And here am I trying to use RDF from  
> both sources: now what do I do about uniqueness of names? Can I  
> infer that [ex:phayes and ex:PatHayes] are two people because they  
> would have been if *you* had used the IRIs? Or (since IRIs have  
> global scope) would your database just have been flat wrong if it  
> had used both of these names? With blank nodes the situation is  
> worse, since there are so many ways to infer that distinct bnodes  
> co-refer that it is hard to count them. Maybe I know that one of  
> your RDF properties is reverse functional...
>
>> My suggestion is that we consider some syntactic way to
>> support both a "concrete" interpretation and an "existential"
>> interpretation.
>
> My suggestion is that we stick to the specs, and that SPARQL should  
> respect the RDF semantics as published, and not sanction  
> 'alternative' semantics. Having alternative semantics for a global  
> interchange notation is like being slightly pregnant.
>
>> My tentative initial solution is a three-way switch:
>> SELECT DISTINCT, SELECT ALL and SELECT LAX.  SELECT DISTINCT  
>> promises to remove duplicates,
>> SELECT ALL promises to deliver all duplicates,
>> and SELECT LAX makes no promises either way.  (Anyone have
>> a better keyword for this choice?)
>>
>> I don't believe this is the complete solution to the issue.
>> The reason is that the issue of duplicates becomes more
>> complicated when using OWL entailment.  OWL permits the
>> deduction that two seemingly distinct IRIs or blank nodes
>> are in fact equal.
>
> No, it allows the deduction that *what two distinct IRIs refer to*  
> are equal. You are confusing use and mention.
>
>>  For example, if the reasoner can deduce
>> that some:IRI1 = some:IRI2, what should the reasoner return
>> for SELECT ALL?
>
> What reasoner? A query answering service is not the same thing as a  
> reasoner. I suggest we keep the two categories of engine distinct,  
> precisely to avoid getting embroiled in tar-pit issues like this.  
> If query answering is supposed to perform arbitrary OWL reasoning,  
> we are going to have some very long waits for answers.
>
>>  Does it return both even though it knows they
>> are equal?  If not, how does the user frame a query to ask
>> for all synonyms of some:IRI1?
>
> Select ?x where (?x owl:sameAs the:IRI)
>
> (assuming OWL rather than RDF entailment for the query answer  
> definition, of course.)
>
>>  What should the
>> reasoner return for SELECT DISTINCT?
>
> First we have to decide what that means. Do you mean distinct  
> bindings, or distinct referents?
>
> Pat
>
>
>>  Does it pick one of the
>> two arbitrarily?
>>
>> Fred
>
>
> -- 
> ---------------------------------------------------------------------
> IHMC		(850)434 8903 or (650)494 3973   home
> 40 South Alcaniz St.	(850)202 4416   office
> Pensacola			(850)202 4440   fax
> FL 32502			(850)291 0667    cell
> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>
>

Received on Thursday, 6 July 2006 22:07:35 UTC