- From: Enrico Franconi <franconi@inf.unibz.it>
- Date: Fri, 7 Jul 2006 00:06:53 +0200
- To: Pat Hayes <phayes@ihmc.us>
- Cc: Fred Zemke <fred.zemke@oracle.com>, public-rdf-dawg@w3.org
- Message-Id: <08D60634-ECCA-4ED1-A319-595A1F268B02@inf.unibz.it>
Wow, this time I mostly agree with Pat! --e. On 6 Jul 2006, at 23:50, Pat Hayes wrote: > >> I have been advocating for strict definitions of the number of >> rows returned by queries. As I understand it, Andy Seaborne >> has advocated an opposite view, that SPARQL should not define >> precisely how many duplicates are returned by a query. >> For example, in >> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/ >> 0005.html >> "In general, it isn't possible to conclude anything about numbers >> of things in RDF. It is in OWL." >> I have also heard the opinion that it does not matter whether >> duplicates are eliminated from a UNION or not; I don't have >> a name or message to cite for that opinion. More generally, >> I think there is an opinion that all SPARQL cares about is >> that the result sequence, after eliminating duplicates, is correct. >> Thus the result of a SELECT is not precisely defined; >> only SELECT DISTINCT is. >> >> In this message I want to start a discussion on this. >> As an initial foray, I will frame the question in terms >> of "concrete" vs. "existential" semantics. >> >> I grant that it is difficult to impossible to be sure that >> two seemingly-different IRIs refer to distinct things. I also >> grant that it is difficult to impossible to be sure that >> two seemingly distinct blank nodes, conceived of as existentials, >> are known to be distinct. However, I wonder whether it is a >> good idea to base our semantics exclusively on these "existential" >> insights. >> >> I think the naive view is that two things are distinct if they >> look distinct. Two IRIs that are spelled differently are >> different. > > The IRIs are distinct, sure: but what they refer to might not be. > >> Two blank nodes with different node identity >> are different. > > If the identifiers are in different document scopes, that is a safe > presumption; and if they are different nodeIDs in the same scope, > then they identify different *nodes*, yes. But whether or not those > different nodes *refer to* the same resource, is an open question. > They might or they might not. > > It isn't clear whether you are talking about distinctness of the > names or of the things the names refer to. Which do you mean? > >> (Blank node identifiers are proxies for node >> identity; two blank nodes with different identifiers are >> different). >> >> I think that in many instances, the users will want this kind of >> concrete interpretation of an RDF graph. > > What do you mean by a concrete interpretation here? You seem to be > talking, above, not bout the interpretations, but about the > syntactic structure of the graphs themselves. > > Do you mean, users will want to be able to apply what is often > called a unique name assumption (that distinct names refer to > distinct things?) I agree this many applications do make this > assumption, but for other kinds of application (eg text scraping) > it is fatal. RDF does not, and should not, make this assumption. > >> Further, I believe >> that when one is working with a concrete interpretation, >> duplicates may carry semantic meaning > > Not any sanctioned by the RDF/S or SPARQL specs. > >> and it is important to >> define precisely how many duplicates are returned. I especially >> believe this is true when there are financial figures involved. >> >> For example, imagine a purchase order encoded in RDF. >> Each purchase order has an IRI. Various facts about the PO >> are assembled using verbs: bill-to, ship-to, and the line items. >> Since bill-to, ship-to and line items are all compound objects, >> they may be represented by blank nodes > > What makes you conclude that 'compound objects' may be represented > by blank nodes? There is a mistake lurking here, that blank nodes > are a kind of data structure. This is a bad RDF design. You should > have IRIs for the line items if you want them to be distinguishable > reliably, and want to be able to refer to them. > >> , which in turn connect >> via various verbs to literals or IRIs. Let's look at the line items >> in particular. A line item consists of a part number (an IRI), >> a quantity (an xsd:integer), and a unit price (an xsd:decimal). >> The user wants to find the total price of a particular PO. >> The query looks something like this: >> >> SELECT ?quantity ?price >> WHERE some:IRI po:po _:lineitem . >> _:lineitem po:quantity ?quantity . >> _:lineitem po:price ?price . >> >> Since SPARQL has no aggregates or expressions in its SELECT >> list, the user intends to simply fetch all rows, multiply >> ?quantity * ?price and take the sum himself. >> >> Now it can happen in a PO that the quantity and price of >> two line items are identical. However, suppressing such >> duplicates would be fatal to this application. > > The fatal mistake happened earlier in the design, when you treated > line items as mere existential appendages to POs. SPARQL can't be > expected to rescue a badly designed RDF application. > >> Note that adding the part number to the SELECT list will >> not necessarily save the query, since the combination of >> part number, quantity and price is still not a guaranteed >> unique key for line items. The user is relying on distinct >> blank nodes to represent distinct line items. > > And they should not have done. > >> Of course, from the point of view of "RDF Semantics" > > I would appreciate not seeing the scare quotes. (I wonder, why do > people seem to think that semantics are fictional or optional? If I > were to use scare quotes when referring to "XML syntax" with > similar implied disdain, would this be an argument for allowing > unbalanced parentheses in an XML application?) > >> that would be a redundant graph, for example, one that >> asserts "There exists a line item whose part is XYZ, >> quantity is 1 and price is 10.99" and asserts again >> "There exists a line item whose part is XYZ, quantity is 1 and >> price is 10.99". Thus one could say that this is a misuse >> of RDF. > > Right, exactly. It would be, and IMO it is a misuse that we should > actively discourage. > >> This may be technically true, > > No, it is simply TRUE. That is what the specs say. > >> but I wonder if insisting >> on this point will really serve the users. > > Yes, in the long run it will, because it will force them to write > RDF applications that actually work correctly according to the RDF > specs, with actual RDF engines, instead of poorly constructed RDF > which will produce accounting errors. > >> If you read the RDF >> Primer, the application design above makes sense. > > The primer is, well, a primer. If ALL you have read is the primer, > then you shouldn't be implementing applications that work with real > money. > >> You have a line >> item; you don't want to bother creating an IRI for each line >> item; so you make a blank node for each line item. >> "RDF Semantics", on the other hand, is a dense document >> with talk about hypothetical universes that are interpretations >> of a graph. This is not the kind of material that will make >> its way into seminars, courses, how-to books, etc. > > Well, actually it already has. In fact, ironically, a colleague of > yours recently told me how useful the RDF semantics had been for > Oracle's own RDF development work. > >> The early days of relational databases encountered the same >> problem. The theorists said a relational table is a set, >> therefore it can have no duplicates, therefore it is up to >> the user to insert some additional piece of information to >> distinguish two otherwise-identical line items, to provide a >> unique key. Sounds great in theory; however, the vendors >> discovered that they had to accomodate the naive view that >> each row has its own identity and is distinct, without requiring >> a unique key. >> >> A slightly different response is that RDF and SPARQL are not >> targeted at such applications. However, the introduction to >> "OWL web ontology language guide" poses this scenario: >> "consider actually assigning a software agent the task of >> making a coherent set of travel arrangements." If eventually >> RDF databases and SPARQL queries are part of such a software >> agent, then it will be necessary to make concrete assurances >> about the total price of a travel plan. In addition, the vision >> is that the dataset will be aggregated from many sites, which >> means that there will not be a central authority to impose >> strict existential semantics. > > You have this exactly backwards. The existential semantics is the > 'un-strict' case, where you are not authorized to make risky > inferences precisely because there is no central authority to > impose a unique name assumption, to warrant you against the risk. > > Maybe your database treats distinct IRIs as referring to distinct > entities, but someone else's RDF might use ex:phayes and > ex:PatHayes to both refer to me, and know enough about owl:sameAs > to be able to handle this. And here am I trying to use RDF from > both sources: now what do I do about uniqueness of names? Can I > infer that [ex:phayes and ex:PatHayes] are two people because they > would have been if *you* had used the IRIs? Or (since IRIs have > global scope) would your database just have been flat wrong if it > had used both of these names? With blank nodes the situation is > worse, since there are so many ways to infer that distinct bnodes > co-refer that it is hard to count them. Maybe I know that one of > your RDF properties is reverse functional... > >> My suggestion is that we consider some syntactic way to >> support both a "concrete" interpretation and an "existential" >> interpretation. > > My suggestion is that we stick to the specs, and that SPARQL should > respect the RDF semantics as published, and not sanction > 'alternative' semantics. Having alternative semantics for a global > interchange notation is like being slightly pregnant. > >> My tentative initial solution is a three-way switch: >> SELECT DISTINCT, SELECT ALL and SELECT LAX. SELECT DISTINCT >> promises to remove duplicates, >> SELECT ALL promises to deliver all duplicates, >> and SELECT LAX makes no promises either way. (Anyone have >> a better keyword for this choice?) >> >> I don't believe this is the complete solution to the issue. >> The reason is that the issue of duplicates becomes more >> complicated when using OWL entailment. OWL permits the >> deduction that two seemingly distinct IRIs or blank nodes >> are in fact equal. > > No, it allows the deduction that *what two distinct IRIs refer to* > are equal. You are confusing use and mention. > >> For example, if the reasoner can deduce >> that some:IRI1 = some:IRI2, what should the reasoner return >> for SELECT ALL? > > What reasoner? A query answering service is not the same thing as a > reasoner. I suggest we keep the two categories of engine distinct, > precisely to avoid getting embroiled in tar-pit issues like this. > If query answering is supposed to perform arbitrary OWL reasoning, > we are going to have some very long waits for answers. > >> Does it return both even though it knows they >> are equal? If not, how does the user frame a query to ask >> for all synonyms of some:IRI1? > > Select ?x where (?x owl:sameAs the:IRI) > > (assuming OWL rather than RDF entailment for the query answer > definition, of course.) > >> What should the >> reasoner return for SELECT DISTINCT? > > First we have to decide what that means. Do you mean distinct > bindings, or distinct referents? > > Pat > > >> Does it pick one of the >> two arbitrarily? >> >> Fred > > > -- > --------------------------------------------------------------------- > IHMC (850)434 8903 or (650)494 3973 home > 40 South Alcaniz St. (850)202 4416 office > Pensacola (850)202 4440 fax > FL 32502 (850)291 0667 cell > phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes > >
Attachments
- application/pkcs7-signature attachment: smime.p7s
Received on Thursday, 6 July 2006 22:07:35 UTC