- From: <jos.deroo@agfa.com>
- Date: Tue, 10 Jan 2006 01:30:12 +0100
- To: phayes@ihmc.us
- Cc: bparsia@isr.umd.edu, Enrico Franconi <franconi@inf.unibz.it>, RDF Data Access Working Group <public-rdf-dawg@w3.org>, public-rdf-dawg-request@w3.org, Sergio Tessaris <tessaris@inf.unibz.it>
- Message-ID: <OFB750C370.BB62C02C-ONC12570F2.00023D61-C12570F2.0002BBEE@agfa.com>
OK Pat, is fine for me (thinking about select sss where www as a rule {www} => {sss} then the bnodes in www can be treated as universals in the scope of that rule, so is fine) -- Jos De Roo, AGFA http://www.agfa.com/w3c/jdroo/ Pat Hayes <phayes@ihmc.us> Sent by: public-rdf-dawg-request@w3.org 10/01/2006 00:27 To: RDF Data Access Working Group <public-rdf-dawg@w3.org> cc: bparsia@isr.umd.edu, Enrico Franconi <franconi@inf.unibz.it>, Sergio Tessaris <tessaris@inf.unibz.it>, (bcc: Jos De Roo/AMDUS/AGFA) Subject: bnodeification (was: Re: SPARQL semantics: open issues for basic query patterns) I'm separating this into two threads, (this) one concerned with tweaking the definitions to handle bnodes properly, the other to do with wording for an 'entailment' parameter, based on Enrico's draft. I'm pretty sure these issues are orthogonal, provided that the definitions refer centrally to 'entails', which can either be the parameter or can be understood as 'simply entails' if we decide not to have an entailment parameter. So I will say 'entails' here without commitment one way or the other. So, I learned last week that bnodes in answers are scoped to the answer SET, in effect, rather than to individual answers: to the table rather than to lines in the table. Sorry it took me so long to grok that. This is great, because this, plus our decision to not have 'told bnodes', means that we can tie the definitions even closer to entailment. The trick for handling told bnodes was to put (a copy of) the graph into the consequent along with the answer instance, forcing them to 'share' bnodes properly. The trick for making the bnodes in the answers behave properly is to put ALL the answers instances into the consequent, which works similarly. So, all the definitions apply to sets of answer bindings rather than individual bindings, call this a binding set. We need to make one rather awkward definition, and its awkwardness arises from the fact that we allow bnodes in query patterns and we want those bnodes to be scoped to the query pattern, whereas we want the bnodes in the answer bindings to be scoped to the entire answer set. Once that is done, the other definitions are as clear as day. Define the ordered merge of an RDF graph G and a SPARQL graph pattern P to be a pattern containing all the triples in G and M(P), where M is a 1:1 mapping on bnodes which replaces any bnode in P which also occurs in G by a new bnode not in G. If there are no such 'shared' bnodes then M is the identity mapping and the merge is simply the union of all the triples in G and P. (This is Enrico's "RDF-merge", but uses the fact that is defined on patterns to be an opportunity to be technically a new definition. I think its slightly misleading to call it RDF-merge since technically its not quite the same definition as RDF merge.) This definition should be understood to apply even when P contains no variables: in that case, this definition is essentially <a href="http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality ">equivalent</a> to that of RDF graph <a href="http://www.w3.org/TR/rdf-mt/#graphdefs">merge</a>. Given a binding set S for a query Q, the answer graph S(Q) is, intuitively, the RDF graph formed by merging together one instance of the query Q under each binding B in S, while maintaining the identity of any bnodes which are introduced by the bindings themselves. More precisely, consider the result A<sub>N</sub> of iterating the following process over the bindings B<sub>1</sub> ... B<sub>N</sub> in S, starting with A<sub>0</sub> as the empty graph: A<sub>n</sub> := B<sub>n</sub>(A<sub>n-1</sub> ordered-merge Q) and define S(Q) to be A<sub>N</sub>. It is easy to see that the choice of ordering of S will affect only the particular bnodeIDs which occur in the final graph, so that the particular ordering can be chosen arbitrarily. Since all the A<sub>n</sub> are RDF graphs which contain no variables, applying the binding mapping after the merge has the result required, of treating bnodes in the query pattern as scoped to the pattern; but all bnodes in answer bindings share a single scope, since once introduced into the answer graph they retain their identity through subsequent ordered merges. (If the query contains no bnodes, then there is a much simpler way to define S(Q), as simply being the union (not merge) of all the query instances, i.e. the RDF graph {B(T): B in S and T in Q}. Another way to define S(Q) is to imagine that every bnode in Q is replaced by a distinct query variable (which cannot be SELECTed, however) and then use this definition. There are cases in which this definition is slightly tighter than the previous one, eliminating some redundancy. Also this doesn't need the ordered-merge idea.) OK, now we have that over and done with, it all gets much easier: A binding set S is correct when G entails S(Q) A binding set S subsumes (we could say 'entails', but it might be confusing(?)) another binding set T when S(Q) entails T(Q). The full answer set is the maximal possible correct set, i.e. all possible entailed instances. (This always exists and is unique, but it is infinite: it's purely a theoretical construct.) An answer set is any correct set which subsumes the full answer set. An answer set is redundant if it is subsumed by a proper subset of itself; otherwise it is lean. OK, that is all. SPARQL servers MUST give only correct binding sets. If the number of answers is not restricted in the query then they MUST give an answer set. They SHOULD NOT introduce redundancies in answer sets. This allows redundant answers: see below for more on this. This is not the same as what you get by defining the analogous thing for a single answer and then combining those into sets, precisely because of the kind of example that Sergio used, where a bnode is shared: _:a :r _:a _:a :p _:b _:b :url <http://example.org> and the query: SELECT * { ?x ?p ?y } Sergio's possible answer for a naive entailment definition: ?x ?p ?y ---------------- _:b12 :r _:b13 _:b26 :p _:b27 _:b40 :url <http://example.org> is now not an answer set, because it doesn't subsume this: ?x ?p ?y ---------------- _:b12 :r _:b12 In fact, the only 'safe' way to generate an answer set (using simple entailment) is to send back every matching binding you find, with the bnode relationships between them all intact. You don't have to use the same bnodeIDs, though. So this isn't an answer set either: ?x ?p ?y ---------------- _:b12 :r _:b12 _:b26 :p _:b27 _:b27 :url <http://example.org> because it doesn't subsume ?x ?p ?y ---------------- _:b12 :r _:b12 _:b12 :p _:b27 Because the bnode relationships have to be preserved in the answer set, this handles Bijan's FOAF example appropriately, eg. _:a foaf:nick "Bijan" _:a :email "badguy@moxie.com" _:a :spouse "Elsie Badgett" _:a :whatever :this _:b :whatever :notThis SELECT * {?x foaf:nick "Bijan" . ?x ?p ?a .} an answer set must contain all the appropriate matches with a shared bnode, something like ?x ?p ?a ---------- _:b12 foaf:nick "Bijan" _:b12 :email "badguy@moxie.com" _:b12 :spouse "Elsie Badgett" _:b12 :whatever :this Answer sets can contain redundancies, but they cannot have misleading weakenings of the kind shown by Sergio recently, where an IRI binding is just ignored and replaced by a bnode, or two occurrences of a single bnode are arbitrarily separated. For example, with the Bijan dataset and the simple query SELECT * {?x foaf:nick ?y} it is OK for the answer set to contain ?x ?y ----------- .... _:b3 _:b4 .... but its not OK for it to *not* contain _:b3 "Bijan" I know many folk don't like allowing redundancy in answer sets. Its easy to tweak the definitions to eliminate it, in fact all we need to say is that an answer set is required to be lean, i.e. to not be subsumed by a proper subset. That is, answer sets can be 'non-lean', just like graphs, but if they are then they always have a lean subset which is also an answer set. However, requiring leanness is likely to be very onerous for servers to check, and in many cases it seems too strong, because it requires servers to give lean answer sets even when working with non-lean graphs, and 'lean-izing' a graph is hard in general; and even if you have a lean graph and you generate answers one at a time, by careful matching, you might still get some redundancy in the answer set. We can adapt the trick used by Enrico to force the bindings to use bnodes 'reasonably', which is to require that only bnodes in the graph can be used as answer bindings, and also include the graph itself in the consequent, thereby forcing the answer instances to not use graph bnodes in capriciously redundant ways. This is easy to state, but I worry that while this works for simple entailment, it may well not be appropriate for other forms of entailment, and it may cause problems when we want to move smoothly between for example OWL entailment considered as applying to OWL abstract syntax and OWL/RDF entailment between graphs, for example, suppose the conclusion graph uses RDF lists as an OWL syntax encoding, constructed using bnodes: do *these* bnodes have to be bnodes from the dataset? If we state this as a basic SPARQL condition at the RDF graph level, we have decided this issue, perhaps prematurely. It might be better to keep the basic definitions simple, as above, and keep the issue of redundancy-elimination slightly separate. (Bijan, suppose we said just that servers SHOULD minimize redundancy as far as practicable, or SHOULD NOT introduce unnecessary or artificial redundancy in answer sets, could you live with that? I'm pretty sure that any implemented SPARQL engine will give redundant answers only 'by accident', rather than set out to deliberately sprinkle bnodes everywhere for no good reason. But read on.) Here's how Enrico-style would look, for the record: A binding set S is regular when (1) it is correct and (2) G entails S(G' ordered-merge Q), and (3) when any bnodeIDs introduced by S occur somewhere in G', where G' is <a href="http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality">equivalent</a> to G. (G' instead of G allows the bnodeIDs used in the answer to be distinct from the graphIDs, which is better if we aren't intending to support told bnodes, I suggest.) Note, that application of S in (1) is of the entire binding set, so it uses the iterative definition from earlier, in effect generating a lot of 'copies' of G, one for each binding in S: but since the result is a set, this doesn't matter. Or, we could re-define the iterative construction for S(Q) but starting from G' instead of the empty graph, which links the two ideas together nicely. Or, we could just define a single answer as a B where G entails B(G order-merge Q), as Enrico's prose does now, and then say that S is the set of all such answers: but that requires a little work to show that it is an answer set.) A regular answer set is a 'reasonably computably minimal' answer set. Its about as non-redundant as you can get while generating answers one at a time, i.e. without needing to check answers against all the other answers. ---- As Dan C noted, all of this might well be identical in operation to his older suggestion: my point is only to offer it as a definitional strategy to keep the definitions as simple and clear as possible, and as purely linked to entailment as possible. What I like about the first idea here is that its directly linked to entailment and in some sense 'mathematically obvious', so I think it is likely to be very robust; and I also think it is easy to understand. We can then treat the second device, of including G' and restricting to bnodes in G' (Sergio was right about not restricting non-bnodes), as a way to eliminate excess redundancy, which we could offer the world as a Good Way to conform to the SHOULD part of the basic definition, but not actually require it. Comments? I will provide answers for any test case anyone has. Pat PS. BTW, in case anyone is misled by my writing style, I see all this as a tweak/increment to the ideas that Enrico and Sergio have provided, more than as a replacement. The definition of ordered-merge is exactly RDF-merge: I suggest the change in name only for reasons of exposition, not to imply spurious claims of authorship or originality. This is all going to go out under Eric and Andy's names anyway. -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32502 (850)291 0667 cell phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Tuesday, 10 January 2006 00:31:16 UTC