- From: Pat Hayes <phayes@ihmc.us>
- Date: Wed, 16 Aug 2006 22:00:00 -0700
- To: Bijan Parsia <bparsia@cs.man.ac.uk>
- Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
>(As I understand them.) > >This is so y'all down have to plow through the >debate. In general, I presume that an answer set >is not correct as an answer to a DISTINCT query >if it is redundant with respect to the notion of >redundancy the particular DISTINCT key answers >to. >================== > >BNODE REDUNDANCY: > >1) An answer set is redundant (with respect to >BNodes) iff the original graph is not lean. > >That is, algebraic operations never introduce >BNode redundancy, thus no minimization after >operations needs to attend to BNodes. > >This is roughly the position that I understand Andy and Pat to hold. Another way to put it is, an answer set is not redundant if a 'larger' query (more triples in the query pattern) exists which would produce an answer set which would not be redundant, from the same KB. [Later: a better way to put it is below] >There is a different formulation: > >1') An answer set is redundant iff an answer >doesn't tell us something distinct about the >graph. > >They may want to restrict this to BNodes only, >but I'm hard pressed to see why. ? Does applying this to non-bnodes have any effect? A related question below. >On this reading, actually, there is no difference between lean and non-lean. Er...not sure I agree with this. >If we have a graph redundancy then the >"redundant" answer tell us something about the >graph. Well, that isn't what I was arguing for. The key point that worries me is that a nonredundant, lean, graph can still give apparently redundant answers, in effect because the query is being answered with respect to a non-lean subgraph (the triples that match with the query pattern). So, in this case, should we say that the answer is redundant or not? I would argue not (unless we forbid told bnodes, and maybe even then); but we should discuss the pros and cons. >====== >2) An answer set is redundant (with respect to >BNode) iff some answer entails another answer > >3) An answer set is redundant (with respect to >Bnodes and coreference) iff some "co-referenced" >set of answers entails another "co-referenced" >set of answers. Can you explain "co-referenced" here? (There is no way to establish coreference without owl:sameAs or something equivalent, AFAIK.) > So there are two possibilities: We treat >each answer independently and look for >subsumptions (one answer subsumes another iff >the first entails the second; tricky issues with >empty bindings to consider) pairwise. Or, we >have to consider *sets* of answers, because >coreference allows us to see *in the answer set* >that an answer isn't redundant. The first is >obviously easier to compute, but the answers >*might* be a bit hard to interpret. The latter >is more difficult to compute (an specify) but >the answers might be more "pleasing". > >Obviously, the notion of "co-referenced" set of >answers (a horrid term) must be defined clearly, >but in this message I want to just give some >examples and use cases to give folks the gist. > >===================================== > >Lean1 :bijan :loves :mochi. > :bijan :eats :mochi. > _:x :hates :mochi. > _:x :scorns :mochi. > > >i) SELECT DISTINCT ?x ?p ?y {?x ?p ?y} > > ?x ?p ?y > :bijan :loves :mochi > :bijan :eats :mochi. > _:x3 :hates :mochi. > _:x3 :scorns :mochi. > >This is the same answer for 1, 2 and 3. The >second variable is unique in every row, thus >each answer is different. Agreed. > >ii) SELECT DISTINCT ?x ?y {?x ?p ?y} > >a. ?x ?y > :bijan :mochi > :bijan :mochi. > _:x3 :mochi. > _:x3 :mochi. > >This is what we get if we take definition 1' as >holding uniformly. There are distinct reasons >for each line. I would be, well, not happy, but maybe complacent with this, in fact, even though I agree it seems kind of silly. But I would be happier with the answer ?x ?y :bijan :mochi _:x :mochi with the following justification: the DISTINCT requests that duplicates be removed from the answer set, and ii.a contains two duplications, which is gratuitous redundancy that serves no useful purpose. The point I was urging was that it would not, in this case, be appropriate to reduce this to a single answer on the grounds that :bijan is an instance of _:x, since the KB has enough information in it to distinguish :bijan from _:x, and this could be detected by a subsequent query using _:x as a told bnode. Nice example, because it suggests a way to phrase the criterion I have been struggling to articulate. Say that two terms a and b are 'distinguishable' (wrt the graph) when there exists a pair of queries which differ only in that one contains a where the other contains b, and give different answer sets. Then the criterion is that *distinguishable answer bindings are never counted as mutually redundant*. Given this, any other redundancy can be eliminated by a DISTINCT. Thus, if the original graph is equivalent to the instance _:x/aa of itself, so is redundant, then this redundancy can be eliminated in any DISTINCT answer set; but apparent bnode/URI redundancies which do not in fact arise from any non-leanness in the graph itself cannot be eliminated. This allows all simple duplications to be eliminated, of course, by definition. (I think to be correct this should be applied to vectors of answer bindings, in fact, rather than pairwise to single bindings.) I am sure that in the worst case this is a bear to compute, but then almost any kind of bnode redundancy minimality is destructively hard in the worst case because one can code up the subgraph problem in it. > >b. ?x ?y > :bijan :mochi > _:x3 :mochi. > _:x3 :mochi. > >This is what we get if take definition 1. There >are distinct reasons for each BNode (but the >coreference is significant). If you change _:x >in the last line of Lean1 to _:y, you get the >less distasteful: > >c. ?x ?y > :bijan :mochi > _:x3 :mochi. > _:x4 :mochi. > >So is ii.b redundant or not? It's clearly >*lexically* redundant. If we allow that why not >ii.a? The duplicated _:x3 is redundant, yes. I agree that c. is right, given the change to _:x. >Won't this be horribly confusing? Yes. But the answer you give in d. might be dangerously *misleading*, which IMO is a lot worse. >Remember we have no *access* to the *reason* an >answer exists, though we could always ask a >*different* query (like i). > >d. ?x ?y > :bijan :mochi > >This is the answer I expect from 2 and 3. The trouble with this is that it withholds potentially important information from the querying agent. Consider the following example to make the point: :Mary :met :Bill :Bill :nationality USA :Mary :met _:x7 _:x7 :nationality IRAQ SELECT DISTINCT ?x [:Mary :met ?x] If you are an government agency trying to keep track of who talked to whom, it would be less than helpful to be told about Bill, but not *anything at all* about the nameless Iraqi who Mary has been in communication with, just because a semantic theory says that {:Mary, _:x7} is technically redundant. DISTINCT in this case was likely intended to mean, distinct *people*, and there is enough information here to enable a human reader to know that _:x7, whoever it is, is not Bill, even though an RDF engine might be too dumb to figure this out. Getting the answer set {Bill, _:x7} in this case tells you that there are two individuals about whom something detectable is recorded, and if we have told bnodes available then it allows a subsequent query to ask about the nationality of _:x7. (They might be the same, of course, but then so might :Bill and :Joe.) But if the answer set simply does not contain _:x7 at all, then there is no way that the querier is able to ask for any further information about this other person who is known to exist. I would guess that the effect of eliminating this much 'redundancy' would simply be a huge pressure to not use bnodes in RDF data, since their presence would be liable to introduce fatal mistakes in query answering; and this in turn would make RDF into little more than a weak database technology. I really do think that cases like this are very important. In another, parallel, universe I have been trying to help persuade some agencies to become more interested in SWeb technology, and examples like this are central to their concerns, which (not surprisingly) have a lot to do with being able to handle incomplete and partial information. > >========= >Lean2 :bijan :loves :zoe. > :zoe :eats :mochi. > _:x :hates _:y. > _:y :scorns :mochi. > >iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y} > >(Same as i, but my numbering is getting confusing enough :)) > > > ?x ?p ?y > :bijan :loves :zoe > :zoe :eats :mochi. > _:x3 :hates _:x4. > _:x4 :scorns :mochi. > >Same for all for the same reasons as before. > >iv) SELECT DISTINCT ?x ?y {?x ?p ?y} > >a. ?x ?y > :bijan :zoe > :zoe :mochi. > _:x3 _:x4. > _:x4 :mochi. > >I think on 1 and 1' this is the correct answer. > >b. ?x ?y > :bijan :zoe > :zoe :mochi. > >I believe on 2 or 3, this is the correct answer. >Consider pairwise subsumption. Line 1 of iv.a >simple entails line 3. Line 2 entails line 4. >Thus, lines 3 and 4 are redundant. That is exactly why using simple pairwise subsumption on the answer bindings, in isolation from the source graph (technically, the scoping graph) isn't an appropriate mechanism to use to detect 'real' redundancy. Or at any rate, its not if we are going to allow told bnodes (which we currently do). Because the scope of told bnodes isn't a single answer document: it has to include the entire session (ie all the answer documents and queries in the session), which amounts to it being the scoping graph itself. The entailments that determine subsumption are then 'inside' this graph (inside a single existential scope) rather than between separate graphs (different existential scopes). >If we take the coreference into account, lines 3 >and 4 *together* are subsumed by lines 1 and 2. > >So, I'm introducing a notion of entailment here >to define subsumption, in some sense. No quarrel with that, but you have to be careful about how your bnodeIDs are scoped. Told bnodes are *the same* between queries, so you have to include enough of the 'context' of the scoping graph when you test the subsumptions. The only general way I can see of doing this is to include it all, as we did in the definition of E-entailment. >Let's turn an answer set into a graph using the following template: > >Each row gets a fresh bnode. >Each column header gets a uri by a base concated >with #variable + the variable name >Each value is itself. Im not sure what this translation is supposed to be representing or what its significance is. > >Each row is translated as follows: > > rowurr columnuri value. > >So, > >iv.a' > _:row1 :variable?x :bijan. > _:row1 :variable?y :zoe. > _:row2 :variable?x :zoe. > _:row2 :variable?y :mochi. > _:row3 :variable?x _:x3. > _:row3 :variable?y _:x4. > _:row4 :variable?x _:x4. > _:row4 :variable?y :mochi. > > _:row1 :variable?x :bijan. > _:row1 :variable?y :zoe. > _:row2 :variable?x :zoe > _:row2 :variable?y :mochi. > > >and > >iv.b' > _:row1 :variable?x :zoe. > _:row1 :variable?y :zoe. > _:row2 :variable?x :zoe. > _:row2 :variable?y :mochi. > >Is iv.a' lean? No. Is iv.b'? Yes. They are also >simple equivalent in that each simply entails >the other. > > iv.a' entails iv.b' because iv.b' is a >subgraph of iv.a'. (Subgraph lemma). Obvious >since the first 4 lines in the same. > > iv.b' entails iv.a' because iv.b' is an >instance of iv.a' (Instance lemma) with the >following substituion: > > _:row1 for _:row3 > _:row2 for _:row4 > :zoe for _:x4 > :bijan for _:x3 . > >Which is what we'd expect. > >I hope I've made clear the various positions and >provided enough test cases to help people >understand the various distinctions. > >Questions, corrections, and clarifications welcome. A question to clarify intuitions, although it is not directly applicable to SPARQL. Consider the following: :Bill :p :a :Joe :p :a :Bill owl:sameAs :Joe SELECT DISTINCT ?x [?x :p :a] If we are using OWL-DL entailment, should the answer set have two bindings or one? An OWL reasoner should be able to figure out that there is only one entity here in all models of the data, so surely being told two names for this one guy is a semantic redundancy, no? (Why not?) And yet I think that any useful query regime should give both answer bindings, simply because the URIs are distinct. The querying agent might not know that they are coreferential, so to be told only one of the names might withhold valuable information. Is this case fundamentally different from the above examples of coreference/redundancy between a bnode and a URI? Why? Pat -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32502 (850)291 0667 cell phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Thursday, 17 August 2006 05:00:26 UTC