- From: Bijan Parsia <bparsia@cs.man.ac.uk>
- Date: Wed, 16 Aug 2006 12:57:40 +0100
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>

(As I understand them.) This is so y'all down have to plow through the debate. In general, I presume that an answer set is not correct as an answer to a DISTINCT query if it is redundant with respect to the notion of redundancy the particular DISTINCT key answers to. ================== BNODE REDUNDANCY: 1) An answer set is redundant (with respect to BNodes) iff the original graph is not lean. That is, algebraic operations never introduce BNode redundancy, thus no minimization after operations needs to attend to BNodes. This is roughly the position that I understand Andy and Pat to hold. There is a different formulation: 1') An answer set is redundant iff an answer doesn't tell us something distinct about the graph. They may want to restrict this to BNodes only, but I'm hard pressed to see why. On this reading, actually, there is no difference between lean and non-lean. If we have a graph redundancy then the "redundant" answer tell us something about the graph. ====== 2) An answer set is redundant (with respect to BNode) iff some answer entails another answer 3) An answer set is redundant (with respect to Bnodes and coreference) iff some "co-referenced" set of answers entails another "co-referenced" set of answers. So there are two possibilities: We treat each answer independently and look for subsumptions (one answer subsumes another iff the first entails the second; tricky issues with empty bindings to consider) pairwise. Or, we have to consider *sets* of answers, because coreference allows us to see *in the answer set* that an answer isn't redundant. The first is obviously easier to compute, but the answers *might* be a bit hard to interpret. The latter is more difficult to compute (an specify) but the answers might be more "pleasing". Obviously, the notion of "co-referenced" set of answers (a horrid term) must be defined clearly, but in this message I want to just give some examples and use cases to give folks the gist. ===================================== Lean1 :bijan :loves :mochi. :bijan :eats :mochi. _:x :hates :mochi. _:x :scorns :mochi. i) SELECT DISTINCT ?x ?p ?y {?x ?p ?y} ?x ?p ?y :bijan :loves :mochi :bijan :eats :mochi. _:x3 :hates :mochi. _:x3 :scorns :mochi. This is the same answer for 1, 2 and 3. The second variable is unique in every row, thus each answer is different. ii) SELECT DISTINCT ?x ?y {?x ?p ?y} a. ?x ?y :bijan :mochi :bijan :mochi. _:x3 :mochi. _:x3 :mochi. This is what we get if we take definition 1' as holding uniformly. There are distinct reasons for each line. b. ?x ?y :bijan :mochi _:x3 :mochi. _:x3 :mochi. This is what we get if take definition 1. There are distinct reasons for each BNode (but the coreference is significant). If you change _:x in the last line of Lean1 to _:y, you get the less distasteful: c. ?x ?y :bijan :mochi _:x3 :mochi. _:x4 :mochi. So is ii.b redundant or not? It's clearly *lexically* redundant. If we allow that why not ii.a? Won't this be horribly confusing? Remember we have no *access* to the *reason* an answer exists, though we could always ask a *different* query (like i). d. ?x ?y :bijan :mochi This is the answer I expect from 2 and 3. ========= Lean2 :bijan :loves :zoe. :zoe :eats :mochi. _:x :hates _:y. _:y :scorns :mochi. iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y} (Same as i, but my numbering is getting confusing enough :)) ?x ?p ?y :bijan :loves :zoe :zoe :eats :mochi. _:x3 :hates _:x4. _:x4 :scorns :mochi. Same for all for the same reasons as before. iv) SELECT DISTINCT ?x ?y {?x ?p ?y} a. ?x ?y :bijan :zoe :zoe :mochi. _:x3 _:x4. _:x4 :mochi. I think on 1 and 1' this is the correct answer. b. ?x ?y :bijan :zoe :zoe :mochi. I believe on 2 or 3, this is the correct answer. Consider pairwise subsumption. Line 1 of iv.a simple entails line 3. Line 2 entails line 4. Thus, lines 3 and 4 are redundant. If we take the coreference into account, lines 3 and 4 *together* are subsumed by lines 1 and 2. So, I'm introducing a notion of entailment here to define subsumption, in some sense. Let's turn an answer set into a graph using the following template: Each row gets a fresh bnode. Each column header gets a uri by a base concated with #variable + the variable name Each value is itself. Each row is translated as follows: rowurr columnuri value. So, iv.a' _:row1 :variable?x :bijan. _:row1 :variable?y :zoe. _:row2 :variable?x :zoe. _:row2 :variable?y :mochi. _:row3 :variable?x _:x3. _:row3 :variable?y _:x4. _:row4 :variable?x _:x4. _:row4 :variable?y :mochi. _:row1 :variable?x :bijan. _:row1 :variable?y :zoe. _:row2 :variable?x :zoe _:row2 :variable?y :mochi. and iv.b' _:row1 :variable?x :zoe. _:row1 :variable?y :zoe. _:row2 :variable?x :zoe. _:row2 :variable?y :mochi. Is iv.a' lean? No. Is iv.b'? Yes. They are also simple equivalent in that each simply entails the other. iv.a' entails iv.b' because iv.b' is a subgraph of iv.a'. (Subgraph lemma). Obvious since the first 4 lines in the same. iv.b' entails iv.a' because iv.b' is an instance of iv.a' (Instance lemma) with the following substituion: _:row1 for _:row3 _:row2 for _:row4 :zoe for _:x4 :bijan for _:x3 . Which is what we'd expect. I hope I've made clear the various positions and provided enough test cases to help people understand the various distinctions. Questions, corrections, and clarifications welcome. I don't believe this fully discharges my action items, but it may handle """*ACTION:* bijan to show that the "strong" version of DISTNCT doesn't interfere with intermittent algebraic operations [recorded in http://www.w3.org/2006/08/15-dawg-minutes.html#action01]""" I mean, it's not a proof, but I can't see how any of my definitions, partial though they may be, would ever require touching intermittent operations. A counterexample would be helpful! (Or an example where the postprocessing approach has unexpected results.) Blanks have to be taken into account, but I don't know that they are on any of the above views thus far. Cheers, Bijan.

Received on Wednesday, 16 August 2006 11:57:58 UTC