- From: Fred Zemke <fred.zemke@oracle.com>
- Date: Thu, 06 Jul 2006 12:56:23 -0700
- To: public-rdf-dawg@w3.org
I have been advocating for strict definitions of the number of rows returned by queries. As I understand it, Andy Seaborne has advocated an opposite view, that SPARQL should not define precisely how many duplicates are returned by a query. For example, in http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0005.html "In general, it isn't possible to conclude anything about numbers of things in RDF. It is in OWL." I have also heard the opinion that it does not matter whether duplicates are eliminated from a UNION or not; I don't have a name or message to cite for that opinion. More generally, I think there is an opinion that all SPARQL cares about is that the result sequence, after eliminating duplicates, is correct. Thus the result of a SELECT is not precisely defined; only SELECT DISTINCT is. In this message I want to start a discussion on this. As an initial foray, I will frame the question in terms of "concrete" vs. "existential" semantics. I grant that it is difficult to impossible to be sure that two seemingly-different IRIs refer to distinct things. I also grant that it is difficult to impossible to be sure that two seemingly distinct blank nodes, conceived of as existentials, are known to be distinct. However, I wonder whether it is a good idea to base our semantics exclusively on these "existential" insights. I think the naive view is that two things are distinct if they look distinct. Two IRIs that are spelled differently are different. Two blank nodes with different node identity are different. (Blank node identifiers are proxies for node identity; two blank nodes with different identifiers are different). I think that in many instances, the users will want this kind of concrete interpretation of an RDF graph. Further, I believe that when one is working with a concrete interpretation, duplicates may carry semantic meaning and it is important to define precisely how many duplicates are returned. I especially believe this is true when there are financial figures involved. For example, imagine a purchase order encoded in RDF. Each purchase order has an IRI. Various facts about the PO are assembled using verbs: bill-to, ship-to, and the line items. Since bill-to, ship-to and line items are all compound objects, they may be represented by blank nodes, which in turn connect via various verbs to literals or IRIs. Let's look at the line items in particular. A line item consists of a part number (an IRI), a quantity (an xsd:integer), and a unit price (an xsd:decimal). The user wants to find the total price of a particular PO. The query looks something like this: SELECT ?quantity ?price WHERE some:IRI po:po _:lineitem . _:lineitem po:quantity ?quantity . _:lineitem po:price ?price . Since SPARQL has no aggregates or expressions in its SELECT list, the user intends to simply fetch all rows, multiply ?quantity * ?price and take the sum himself. Now it can happen in a PO that the quantity and price of two line items are identical. However, suppressing such duplicates would be fatal to this application. Note that adding the part number to the SELECT list will not necessarily save the query, since the combination of part number, quantity and price is still not a guaranteed unique key for line items. The user is relying on distinct blank nodes to represent distinct line items. Of course, from the point of view of "RDF Semantics" that would be a redundant graph, for example, one that asserts "There exists a line item whose part is XYZ, quantity is 1 and price is 10.99" and asserts again "There exists a line item whose part is XYZ, quantity is 1 and price is 10.99". Thus one could say that this is a misuse of RDF. This may be technically true, but I wonder if insisting on this point will really serve the users. If you read the RDF Primer, the application design above makes sense. You have a line item; you don't want to bother creating an IRI for each line item; so you make a blank node for each line item. "RDF Semantics", on the other hand, is a dense document with talk about hypothetical universes that are interpretations of a graph. This is not the kind of material that will make its way into seminars, courses, how-to books, etc. The early days of relational databases encountered the same problem. The theorists said a relational table is a set, therefore it can have no duplicates, therefore it is up to the user to insert some additional piece of information to distinguish two otherwise-identical line items, to provide a unique key. Sounds great in theory; however, the vendors discovered that they had to accomodate the naive view that each row has its own identity and is distinct, without requiring a unique key. A slightly different response is that RDF and SPARQL are not targeted at such applications. However, the introduction to "OWL web ontology language guide" poses this scenario: "consider actually assigning a software agent the task of making a coherent set of travel arrangements." If eventually RDF databases and SPARQL queries are part of such a software agent, then it will be necessary to make concrete assurances about the total price of a travel plan. In addition, the vision is that the dataset will be aggregated from many sites, which means that there will not be a central authority to impose strict existential semantics. My suggestion is that we consider some syntactic way to support both a "concrete" interpretation and an "existential" interpretation. My tentative initial solution is a three-way switch: SELECT DISTINCT, SELECT ALL and SELECT LAX. SELECT DISTINCT promises to remove duplicates, SELECT ALL promises to deliver all duplicates, and SELECT LAX makes no promises either way. (Anyone have a better keyword for this choice?) I don't believe this is the complete solution to the issue. The reason is that the issue of duplicates becomes more complicated when using OWL entailment. OWL permits the deduction that two seemingly distinct IRIs or blank nodes are in fact equal. For example, if the reasoner can deduce that some:IRI1 = some:IRI2, what should the reasoner return for SELECT ALL? Does it return both even though it knows they are equal? If not, how does the user frame a query to ask for all synonyms of some:IRI1? What should the reasoner return for SELECT DISTINCT? Does it pick one of the two arbitrarily? Fred
Received on Thursday, 6 July 2006 19:56:37 UTC