- From: Pat Hayes <phayes@ihmc.us>
- Date: Wed, 16 Aug 2006 22:00:00 -0700
- To: Bijan Parsia <bparsia@cs.man.ac.uk>
- Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
>(As I understand them.)
>
>This is so y'all down have to plow through the
>debate. In general, I presume that an answer set
>is not correct as an answer to a DISTINCT query
>if it is redundant with respect to the notion of
>redundancy the particular DISTINCT key answers
>to.
>==================
>
>BNODE REDUNDANCY:
>
>1) An answer set is redundant (with respect to
>BNodes) iff the original graph is not lean.
>
>That is, algebraic operations never introduce
>BNode redundancy, thus no minimization after
>operations needs to attend to BNodes.
>
>This is roughly the position that I understand Andy and Pat to hold.
Another way to put it is, an answer set is not
redundant if a 'larger' query (more triples in
the query pattern) exists which would produce an
answer set which would not be redundant, from the
same KB. [Later: a better way to put it is below]
>There is a different formulation:
>
>1') An answer set is redundant iff an answer
>doesn't tell us something distinct about the
>graph.
>
>They may want to restrict this to BNodes only,
>but I'm hard pressed to see why.
? Does applying this to non-bnodes have any effect? A related question below.
>On this reading, actually, there is no difference between lean and non-lean.
Er...not sure I agree with this.
>If we have a graph redundancy then the
>"redundant" answer tell us something about the
>graph.
Well, that isn't what I was arguing for. The key
point that worries me is that a nonredundant,
lean, graph can still give apparently redundant
answers, in effect because the query is being
answered with respect to a non-lean subgraph (the
triples that match with the query pattern). So,
in this case, should we say that the answer is
redundant or not? I would argue not (unless we
forbid told bnodes, and maybe even then); but we
should discuss the pros and cons.
>======
>2) An answer set is redundant (with respect to
>BNode) iff some answer entails another answer
>
>3) An answer set is redundant (with respect to
>Bnodes and coreference) iff some "co-referenced"
>set of answers entails another "co-referenced"
>set of answers.
Can you explain "co-referenced" here? (There is
no way to establish coreference without
owl:sameAs or something equivalent, AFAIK.)
> So there are two possibilities: We treat
>each answer independently and look for
>subsumptions (one answer subsumes another iff
>the first entails the second; tricky issues with
>empty bindings to consider) pairwise. Or, we
>have to consider *sets* of answers, because
>coreference allows us to see *in the answer set*
>that an answer isn't redundant. The first is
>obviously easier to compute, but the answers
>*might* be a bit hard to interpret. The latter
>is more difficult to compute (an specify) but
>the answers might be more "pleasing".
>
>Obviously, the notion of "co-referenced" set of
>answers (a horrid term) must be defined clearly,
>but in this message I want to just give some
>examples and use cases to give folks the gist.
>
>=====================================
>
>Lean1 :bijan :loves :mochi.
> :bijan :eats :mochi.
> _:x :hates :mochi.
> _:x :scorns :mochi.
>
>
>i) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>
> ?x ?p ?y
> :bijan :loves :mochi
> :bijan :eats :mochi.
> _:x3 :hates :mochi.
> _:x3 :scorns :mochi.
>
>This is the same answer for 1, 2 and 3. The
>second variable is unique in every row, thus
>each answer is different.
Agreed.
>
>ii) SELECT DISTINCT ?x ?y {?x ?p ?y}
>
>a. ?x ?y
> :bijan :mochi
> :bijan :mochi.
> _:x3 :mochi.
> _:x3 :mochi.
>
>This is what we get if we take definition 1' as
>holding uniformly. There are distinct reasons
>for each line.
I would be, well, not happy, but maybe complacent
with this, in fact, even though I agree it seems
kind of silly. But I would be happier with the
answer
?x ?y
:bijan :mochi
_:x :mochi
with the following justification: the DISTINCT
requests that duplicates be removed from the
answer set, and ii.a contains two duplications,
which is gratuitous redundancy that serves no
useful purpose. The point I was urging was that
it would not, in this case, be appropriate to
reduce this to a single answer on the grounds
that :bijan is an instance of _:x, since the KB
has enough information in it to distinguish
:bijan from _:x, and this could be detected by a
subsequent query using _:x as a told bnode.
Nice example, because it suggests a way to phrase
the criterion I have been struggling to
articulate. Say that two terms a and b are
'distinguishable' (wrt the graph) when there
exists a pair of queries which differ only in
that one contains a where the other contains b,
and give different answer sets. Then the
criterion is that *distinguishable answer
bindings are never counted as mutually
redundant*. Given this, any other redundancy can
be eliminated by a DISTINCT. Thus, if the
original graph is equivalent to the instance
_:x/aa of itself, so is redundant, then this
redundancy can be eliminated in any DISTINCT
answer set; but apparent bnode/URI redundancies
which do not in fact arise from any non-leanness
in the graph itself cannot be eliminated. This
allows all simple duplications to be eliminated,
of course, by definition.
(I think to be correct this should be applied to
vectors of answer bindings, in fact, rather than
pairwise to single bindings.)
I am sure that in the worst case this is a bear
to compute, but then almost any kind of bnode
redundancy minimality is destructively hard in
the worst case because one can code up the
subgraph problem in it.
>
>b. ?x ?y
> :bijan :mochi
> _:x3 :mochi.
> _:x3 :mochi.
>
>This is what we get if take definition 1. There
>are distinct reasons for each BNode (but the
>coreference is significant). If you change _:x
>in the last line of Lean1 to _:y, you get the
>less distasteful:
>
>c. ?x ?y
> :bijan :mochi
> _:x3 :mochi.
> _:x4 :mochi.
>
>So is ii.b redundant or not? It's clearly
>*lexically* redundant. If we allow that why not
>ii.a?
The duplicated _:x3 is redundant, yes. I agree
that c. is right, given the change to _:x.
>Won't this be horribly confusing?
Yes. But the answer you give in d. might be
dangerously *misleading*, which IMO is a lot
worse.
>Remember we have no *access* to the *reason* an
>answer exists, though we could always ask a
>*different* query (like i).
>
>d. ?x ?y
> :bijan :mochi
>
>This is the answer I expect from 2 and 3.
The trouble with this is that it withholds
potentially important information from the
querying agent. Consider the following example to
make the point:
:Mary :met :Bill
:Bill :nationality USA
:Mary :met _:x7
_:x7 :nationality IRAQ
SELECT DISTINCT ?x [:Mary :met ?x]
If you are an government agency trying to keep
track of who talked to whom, it would be less
than helpful to be told about Bill, but not
*anything at all* about the nameless Iraqi who
Mary has been in communication with, just because
a semantic theory says that {:Mary, _:x7} is
technically redundant. DISTINCT in this case was
likely intended to mean, distinct *people*, and
there is enough information here to enable a
human reader to know that _:x7, whoever it is, is
not Bill, even though an RDF engine might be too
dumb to figure this out. Getting the answer set
{Bill, _:x7} in this case tells you that there
are two individuals about whom something
detectable is recorded, and if we have told
bnodes available then it allows a subsequent
query to ask about the nationality of _:x7. (They
might be the same, of course, but then so might
:Bill and :Joe.) But if the answer set simply
does not contain _:x7 at all, then there is no
way that the querier is able to ask for any
further information about this other person who
is known to exist.
I would guess that the effect of eliminating this
much 'redundancy' would simply be a huge pressure
to not use bnodes in RDF data, since their
presence would be liable to introduce fatal
mistakes in query answering; and this in turn
would make RDF into little more than a weak
database technology.
I really do think that cases like this are very
important. In another, parallel, universe I have
been trying to help persuade some agencies to
become more interested in SWeb technology, and
examples like this are central to their concerns,
which (not surprisingly) have a lot to do with
being able to handle incomplete and partial
information.
>
>=========
>Lean2 :bijan :loves :zoe.
> :zoe :eats :mochi.
> _:x :hates _:y.
> _:y :scorns :mochi.
>
>iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>
>(Same as i, but my numbering is getting confusing enough :))
>
>
> ?x ?p ?y
> :bijan :loves :zoe
> :zoe :eats :mochi.
> _:x3 :hates _:x4.
> _:x4 :scorns :mochi.
>
>Same for all for the same reasons as before.
>
>iv) SELECT DISTINCT ?x ?y {?x ?p ?y}
>
>a. ?x ?y
> :bijan :zoe
> :zoe :mochi.
> _:x3 _:x4.
> _:x4 :mochi.
>
>I think on 1 and 1' this is the correct answer.
>
>b. ?x ?y
> :bijan :zoe
> :zoe :mochi.
>
>I believe on 2 or 3, this is the correct answer.
>Consider pairwise subsumption. Line 1 of iv.a
>simple entails line 3. Line 2 entails line 4.
>Thus, lines 3 and 4 are redundant.
That is exactly why using simple pairwise
subsumption on the answer bindings, in isolation
from the source graph (technically, the scoping
graph) isn't an appropriate mechanism to use to
detect 'real' redundancy. Or at any rate, its not
if we are going to allow told bnodes (which we
currently do). Because the scope of told bnodes
isn't a single answer document: it has to include
the entire session (ie all the answer documents
and queries in the session), which amounts to it
being the scoping graph itself. The entailments
that determine subsumption are then 'inside' this
graph (inside a single existential scope) rather
than between separate graphs (different
existential scopes).
>If we take the coreference into account, lines 3
>and 4 *together* are subsumed by lines 1 and 2.
>
>So, I'm introducing a notion of entailment here
>to define subsumption, in some sense.
No quarrel with that, but you have to be careful
about how your bnodeIDs are scoped. Told bnodes
are *the same* between queries, so you have to
include enough of the 'context' of the scoping
graph when you test the subsumptions. The only
general way I can see of doing this is to include
it all, as we did in the definition of
E-entailment.
>Let's turn an answer set into a graph using the following template:
>
>Each row gets a fresh bnode.
>Each column header gets a uri by a base concated
>with #variable + the variable name
>Each value is itself.
Im not sure what this translation is supposed to
be representing or what its significance is.
>
>Each row is translated as follows:
>
> rowurr columnuri value.
>
>So,
>
>iv.a'
> _:row1 :variable?x :bijan.
> _:row1 :variable?y :zoe.
> _:row2 :variable?x :zoe.
> _:row2 :variable?y :mochi.
> _:row3 :variable?x _:x3.
> _:row3 :variable?y _:x4.
> _:row4 :variable?x _:x4.
> _:row4 :variable?y :mochi.
>
> _:row1 :variable?x :bijan.
> _:row1 :variable?y :zoe.
> _:row2 :variable?x :zoe
> _:row2 :variable?y :mochi.
>
>
>and
>
>iv.b'
> _:row1 :variable?x :zoe.
> _:row1 :variable?y :zoe.
> _:row2 :variable?x :zoe.
> _:row2 :variable?y :mochi.
>
>Is iv.a' lean? No. Is iv.b'? Yes. They are also
>simple equivalent in that each simply entails
>the other.
>
> iv.a' entails iv.b' because iv.b' is a
>subgraph of iv.a'. (Subgraph lemma). Obvious
>since the first 4 lines in the same.
>
> iv.b' entails iv.a' because iv.b' is an
>instance of iv.a' (Instance lemma) with the
>following substituion:
>
> _:row1 for _:row3
> _:row2 for _:row4
> :zoe for _:x4
> :bijan for _:x3 .
>
>Which is what we'd expect.
>
>I hope I've made clear the various positions and
>provided enough test cases to help people
>understand the various distinctions.
>
>Questions, corrections, and clarifications welcome.
A question to clarify intuitions, although it is
not directly applicable to SPARQL. Consider the
following:
:Bill :p :a
:Joe :p :a
:Bill owl:sameAs :Joe
SELECT DISTINCT ?x [?x :p :a]
If we are using OWL-DL entailment, should the
answer set have two bindings or one?
An OWL reasoner should be able to figure out that
there is only one entity here in all models of
the data, so surely being told two names for this
one guy is a semantic redundancy, no? (Why not?)
And yet I think that any useful query regime
should give both answer bindings, simply because
the URIs are distinct. The querying agent might
not know that they are coreferential, so to be
told only one of the names might withhold
valuable information.
Is this case fundamentally different from the
above examples of coreference/redundancy between
a bnode and a URI? Why?
Pat
--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 cell
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Thursday, 17 August 2006 05:00:26 UTC