- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Mon, 10 Jul 2006 13:15:33 +0100
- To: Enrico Franconi <franconi@inf.unibz.it>, Pat Hayes <phayes@ihmc.us>, Fred Zemke <fred.zemke@oracle.com>, public-rdf-dawg@w3.org
We have to start somewhere and, for me, that means building on the existing RDF standard. SPARQL is a web language. Merging data is an important aspect so IRIs that look different may well refer to the same object. (If IRIs are issued by a single, controlled application environment, the unique name assumption is a reasonable choice but that isn't the case on the web.) - - - - - - This does not seem to be anything about blank nodes in queries - this query has the same issues where there is a blank node in the data for line items. SELECT ?quantity ?price WHERE { some:IRI po:po ?lineItem . ?lineItem po:quantity ?quantity ; po:price ?price . } and whether the number is results is based on the number of (?lineItem , ?quantity, ?price) or the number of (?quantity, ?price). Rewording: Would: SELECT ?x { ?x ?p ?o } count nodes in the graph or triples? Or not count at all? An RDF graph is a set of triples (by definition). In the case where there is a unique, finite logical closure (under some set of rules, not necessarily fixed to RDFS or etc), then we could define SPARQL more restrictively to have a solution sequence that returns the ways in which the graph pattern matches and respects duplicates; blank nodes in BGP matching behave like named variables (and named variables which have existential characteristics because they have to be bound for a BGP match anyway). (The 'finite' is because simple entailment leads to trivially infinite graphs - the lean form is finite). Triple counting would give you the effect of (assuming lineItem have IRIs or some property that identifies them): SELECT ?cost := ?quantity*?price # Not mentioning lineItem WHERE { some:IRI po:po ?lineItem . ?lineItem po:quantity ?quantity ; po:price ?price . } but it does not extend to, for example, OWL disjunction, where there is no defined closure and there is no syntactic (virtual) graph to query. Where there is not a unique logical closure, I look to others to provide the semantics. It seems natural to have the semantics in the unique logical closure case be a restriction of a more general set of semantics. For what it's worth, ARQ does treat blank nodes in queries as system-named variables. This is to support the use case of navigating the graph in order to edit it fro the application. And restriction to purely lean graphs is not workable. Andy Enrico Franconi wrote: > Wow, this time I mostly agree with Pat! > --e. > > On 6 Jul 2006, at 23:50, Pat Hayes wrote: > >>> I have been advocating for strict definitions of the number of >>> rows returned by queries. As I understand it, Andy Seaborne >>> has advocated an opposite view, that SPARQL should not define >>> precisely how many duplicates are returned by a query. >>> For example, in >>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/ >>> 0005.html >>> "In general, it isn't possible to conclude anything about numbers >>> of things in RDF. It is in OWL." >>> I have also heard the opinion that it does not matter whether >>> duplicates are eliminated from a UNION or not; I don't have >>> a name or message to cite for that opinion. More generally, >>> I think there is an opinion that all SPARQL cares about is >>> that the result sequence, after eliminating duplicates, is correct. >>> Thus the result of a SELECT is not precisely defined; >>> only SELECT DISTINCT is. >>> >>> In this message I want to start a discussion on this. >>> As an initial foray, I will frame the question in terms >>> of "concrete" vs. "existential" semantics. >>> >>> I grant that it is difficult to impossible to be sure that >>> two seemingly-different IRIs refer to distinct things. I also >>> grant that it is difficult to impossible to be sure that >>> two seemingly distinct blank nodes, conceived of as existentials, >>> are known to be distinct. However, I wonder whether it is a >>> good idea to base our semantics exclusively on these "existential" >>> insights. >>> >>> I think the naive view is that two things are distinct if they >>> look distinct. Two IRIs that are spelled differently are >>> different. >> The IRIs are distinct, sure: but what they refer to might not be. >> >>> Two blank nodes with different node identity >>> are different. >> If the identifiers are in different document scopes, that is a safe >> presumption; and if they are different nodeIDs in the same scope, >> then they identify different *nodes*, yes. But whether or not those >> different nodes *refer to* the same resource, is an open question. >> They might or they might not. >> >> It isn't clear whether you are talking about distinctness of the >> names or of the things the names refer to. Which do you mean? >> >>> (Blank node identifiers are proxies for node >>> identity; two blank nodes with different identifiers are >>> different). >>> >>> I think that in many instances, the users will want this kind of >>> concrete interpretation of an RDF graph. >> What do you mean by a concrete interpretation here? You seem to be >> talking, above, not bout the interpretations, but about the >> syntactic structure of the graphs themselves. >> >> Do you mean, users will want to be able to apply what is often >> called a unique name assumption (that distinct names refer to >> distinct things?) I agree this many applications do make this >> assumption, but for other kinds of application (eg text scraping) >> it is fatal. RDF does not, and should not, make this assumption. >> >>> Further, I believe >>> that when one is working with a concrete interpretation, >>> duplicates may carry semantic meaning >> Not any sanctioned by the RDF/S or SPARQL specs. >> >>> and it is important to >>> define precisely how many duplicates are returned. I especially >>> believe this is true when there are financial figures involved. >>> >>> For example, imagine a purchase order encoded in RDF. >>> Each purchase order has an IRI. Various facts about the PO >>> are assembled using verbs: bill-to, ship-to, and the line items. >>> Since bill-to, ship-to and line items are all compound objects, >>> they may be represented by blank nodes >> What makes you conclude that 'compound objects' may be represented >> by blank nodes? There is a mistake lurking here, that blank nodes >> are a kind of data structure. This is a bad RDF design. You should >> have IRIs for the line items if you want them to be distinguishable >> reliably, and want to be able to refer to them. >> >>> , which in turn connect >>> via various verbs to literals or IRIs. Let's look at the line items >>> in particular. A line item consists of a part number (an IRI), >>> a quantity (an xsd:integer), and a unit price (an xsd:decimal). >>> The user wants to find the total price of a particular PO. >>> The query looks something like this: >>> >>> SELECT ?quantity ?price >>> WHERE some:IRI po:po _:lineitem . >>> _:lineitem po:quantity ?quantity . >>> _:lineitem po:price ?price . >>> >>> Since SPARQL has no aggregates or expressions in its SELECT >>> list, the user intends to simply fetch all rows, multiply >>> ?quantity * ?price and take the sum himself. >>> >>> Now it can happen in a PO that the quantity and price of >>> two line items are identical. However, suppressing such >>> duplicates would be fatal to this application. >> The fatal mistake happened earlier in the design, when you treated >> line items as mere existential appendages to POs. SPARQL can't be >> expected to rescue a badly designed RDF application. >> >>> Note that adding the part number to the SELECT list will >>> not necessarily save the query, since the combination of >>> part number, quantity and price is still not a guaranteed >>> unique key for line items. The user is relying on distinct >>> blank nodes to represent distinct line items. >> And they should not have done. >> >>> Of course, from the point of view of "RDF Semantics" >> I would appreciate not seeing the scare quotes. (I wonder, why do >> people seem to think that semantics are fictional or optional? If I >> were to use scare quotes when referring to "XML syntax" with >> similar implied disdain, would this be an argument for allowing >> unbalanced parentheses in an XML application?) >> >>> that would be a redundant graph, for example, one that >>> asserts "There exists a line item whose part is XYZ, >>> quantity is 1 and price is 10.99" and asserts again >>> "There exists a line item whose part is XYZ, quantity is 1 and >>> price is 10.99". Thus one could say that this is a misuse >>> of RDF. >> Right, exactly. It would be, and IMO it is a misuse that we should >> actively discourage. >> >>> This may be technically true, >> No, it is simply TRUE. That is what the specs say. >> >>> but I wonder if insisting >>> on this point will really serve the users. >> Yes, in the long run it will, because it will force them to write >> RDF applications that actually work correctly according to the RDF >> specs, with actual RDF engines, instead of poorly constructed RDF >> which will produce accounting errors. >> >>> If you read the RDF >>> Primer, the application design above makes sense. >> The primer is, well, a primer. If ALL you have read is the primer, >> then you shouldn't be implementing applications that work with real >> money. >> >>> You have a line >>> item; you don't want to bother creating an IRI for each line >>> item; so you make a blank node for each line item. >>> "RDF Semantics", on the other hand, is a dense document >>> with talk about hypothetical universes that are interpretations >>> of a graph. This is not the kind of material that will make >>> its way into seminars, courses, how-to books, etc. >> Well, actually it already has. In fact, ironically, a colleague of >> yours recently told me how useful the RDF semantics had been for >> Oracle's own RDF development work. >> >>> The early days of relational databases encountered the same >>> problem. The theorists said a relational table is a set, >>> therefore it can have no duplicates, therefore it is up to >>> the user to insert some additional piece of information to >>> distinguish two otherwise-identical line items, to provide a >>> unique key. Sounds great in theory; however, the vendors >>> discovered that they had to accomodate the naive view that >>> each row has its own identity and is distinct, without requiring >>> a unique key. >>> >>> A slightly different response is that RDF and SPARQL are not >>> targeted at such applications. However, the introduction to >>> "OWL web ontology language guide" poses this scenario: >>> "consider actually assigning a software agent the task of >>> making a coherent set of travel arrangements." If eventually >>> RDF databases and SPARQL queries are part of such a software >>> agent, then it will be necessary to make concrete assurances >>> about the total price of a travel plan. In addition, the vision >>> is that the dataset will be aggregated from many sites, which >>> means that there will not be a central authority to impose >>> strict existential semantics. >> You have this exactly backwards. The existential semantics is the >> 'un-strict' case, where you are not authorized to make risky >> inferences precisely because there is no central authority to >> impose a unique name assumption, to warrant you against the risk. >> >> Maybe your database treats distinct IRIs as referring to distinct >> entities, but someone else's RDF might use ex:phayes and >> ex:PatHayes to both refer to me, and know enough about owl:sameAs >> to be able to handle this. And here am I trying to use RDF from >> both sources: now what do I do about uniqueness of names? Can I >> infer that [ex:phayes and ex:PatHayes] are two people because they >> would have been if *you* had used the IRIs? Or (since IRIs have >> global scope) would your database just have been flat wrong if it >> had used both of these names? With blank nodes the situation is >> worse, since there are so many ways to infer that distinct bnodes >> co-refer that it is hard to count them. Maybe I know that one of >> your RDF properties is reverse functional... >> >>> My suggestion is that we consider some syntactic way to >>> support both a "concrete" interpretation and an "existential" >>> interpretation. >> My suggestion is that we stick to the specs, and that SPARQL should >> respect the RDF semantics as published, and not sanction >> 'alternative' semantics. Having alternative semantics for a global >> interchange notation is like being slightly pregnant. >> >>> My tentative initial solution is a three-way switch: >>> SELECT DISTINCT, SELECT ALL and SELECT LAX. SELECT DISTINCT >>> promises to remove duplicates, >>> SELECT ALL promises to deliver all duplicates, >>> and SELECT LAX makes no promises either way. (Anyone have >>> a better keyword for this choice?) >>> >>> I don't believe this is the complete solution to the issue. >>> The reason is that the issue of duplicates becomes more >>> complicated when using OWL entailment. OWL permits the >>> deduction that two seemingly distinct IRIs or blank nodes >>> are in fact equal. >> No, it allows the deduction that *what two distinct IRIs refer to* >> are equal. You are confusing use and mention. >> >>> For example, if the reasoner can deduce >>> that some:IRI1 = some:IRI2, what should the reasoner return >>> for SELECT ALL? >> What reasoner? A query answering service is not the same thing as a >> reasoner. I suggest we keep the two categories of engine distinct, >> precisely to avoid getting embroiled in tar-pit issues like this. >> If query answering is supposed to perform arbitrary OWL reasoning, >> we are going to have some very long waits for answers. >> >>> Does it return both even though it knows they >>> are equal? If not, how does the user frame a query to ask >>> for all synonyms of some:IRI1? >> Select ?x where (?x owl:sameAs the:IRI) >> >> (assuming OWL rather than RDF entailment for the query answer >> definition, of course.) >> >>> What should the >>> reasoner return for SELECT DISTINCT? >> First we have to decide what that means. Do you mean distinct >> bindings, or distinct referents? >> >> Pat >> >> >>> Does it pick one of the >>> two arbitrarily? >>> >>> Fred >> >> -- >> --------------------------------------------------------------------- >> IHMC (850)434 8903 or (650)494 3973 home >> 40 South Alcaniz St. (850)202 4416 office >> Pensacola (850)202 4440 fax >> FL 32502 (850)291 0667 cell >> phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes >> >> >
Received on Monday, 10 July 2006 12:15:50 UTC