Re: Summary of BNode redundancy options (at the moment) from Pat Hayes on 2006-08-17 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Wed, 16 Aug 2006 22:00:00 -0700
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230941c108f21c4d3a@[192.168.1.6]>
>(As I understand them.)
>
>This is so y'all down have to plow through the 
>debate. In general, I presume that an answer set 
>is not correct as an answer to a DISTINCT query 
>if it is redundant with respect to the notion of 
>redundancy the particular DISTINCT key answers 
>to.
>==================
>
>BNODE REDUNDANCY:
>
>1) An answer set is redundant (with respect to 
>BNodes) iff the original graph is not lean.
>
>That is, algebraic operations never introduce 
>BNode redundancy, thus no minimization after 
>operations needs to attend to BNodes.
>
>This is roughly the position that I understand Andy and Pat to hold.

Another way to put it is, an answer set is not 
redundant if a 'larger' query (more triples in 
the query pattern) exists which would produce an 
answer set which would not be redundant, from the 
same KB. [Later: a better way to put it is below]

>There is a different formulation:
>
>1') An answer set is redundant iff an answer 
>doesn't tell us something distinct about the 
>graph.
>
>They may want to restrict this to BNodes only, 
>but I'm hard pressed to see why.

? Does applying this to non-bnodes have any effect? A related question below.

>On this reading, actually, there is no difference between lean and non-lean.

Er...not sure I agree with this.

>If we have a graph redundancy then the 
>"redundant" answer tell us something about the 
>graph.

Well, that isn't what I was arguing for. The key 
point that worries me is that a nonredundant, 
lean, graph can still give apparently redundant 
answers, in effect because the query is being 
answered with respect to a non-lean subgraph (the 
triples that match with the query pattern). So, 
in this case, should we say that the answer is 
redundant or not? I would argue not (unless we 
forbid told bnodes, and maybe even then); but we 
should discuss the pros and cons.

>======
>2) An answer set is redundant (with respect to 
>BNode) iff some answer entails another answer
>
>3) An  answer set is redundant (with respect to 
>Bnodes and coreference) iff some "co-referenced" 
>set of answers entails another "co-referenced" 
>set of answers.

Can you explain "co-referenced" here? (There is 
no way to establish coreference without 
owl:sameAs or something equivalent, AFAIK.)

>	So there are two possibilities: We treat 
>each answer independently and look for 
>subsumptions (one answer subsumes another iff 
>the first entails the second; tricky issues with 
>empty bindings to consider) pairwise. Or, we 
>have to consider *sets* of answers, because 
>coreference allows us to see *in the answer set* 
>that an answer isn't redundant. The first is 
>obviously easier to compute, but the answers 
>*might* be a bit hard to interpret. The latter 
>is more difficult to compute (an specify) but 
>the answers might be more "pleasing".
>
>Obviously, the notion of "co-referenced" set of 
>answers (a horrid term) must be defined clearly, 
>but in this message I want to just give some 
>examples and use cases to give folks the gist.
>
>=====================================
>
>Lean1	:bijan :loves :mochi.
>		:bijan :eats :mochi.
>		_:x :hates :mochi.
>		_:x :scorns :mochi.
>
>
>i) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>
>	?x		?p		?y
>	:bijan	:loves	:mochi
>	:bijan	:eats	:mochi.
>	_:x3		:hates	:mochi.
>	_:x3	:scorns	:mochi.
>
>This is the same answer for 1, 2 and 3. The 
>second variable is unique in every row, thus 
>each answer is different.

Agreed.

>
>ii) SELECT DISTINCT ?x ?y {?x ?p ?y}
>
>a.	?x		?y
>	:bijan	:mochi
>	:bijan	:mochi.
>	_:x3		:mochi.
>	_:x3 	:mochi.
>
>This is what we get if we take definition 1' as 
>holding uniformly. There are distinct reasons 
>for each line.

I would be, well, not happy, but maybe complacent 
with this, in fact, even though I agree it seems 
kind of silly. But I would be happier with the 
answer

?x	?y
:bijan	:mochi
_:x	:mochi

with the following justification: the DISTINCT 
requests that duplicates be removed from the 
answer set, and ii.a contains two duplications, 
which is gratuitous redundancy that serves no 
useful purpose. The point I was urging was that 
it would not, in this case, be appropriate to 
reduce this to a single answer on the grounds 
that :bijan is an instance of _:x, since the KB 
has enough information in it to distinguish 
:bijan from _:x, and this could be detected by a 
subsequent query using _:x as a told bnode.

Nice example, because it suggests a way to phrase 
the criterion I have been struggling to 
articulate. Say that two terms a and b are 
'distinguishable' (wrt the graph) when there 
exists a pair of queries which differ only in 
that one contains a where the other contains b, 
and give different answer sets. Then the 
criterion is that *distinguishable answer 
bindings are never counted as mutually 
redundant*. Given this, any other redundancy can 
be eliminated by a DISTINCT. Thus, if the 
original graph is equivalent to the instance 
_:x/aa of itself, so is redundant, then this 
redundancy can be eliminated in any DISTINCT 
answer set; but apparent bnode/URI redundancies 
which do not in fact arise from any non-leanness 
in the graph itself cannot be eliminated. This 
allows all simple duplications to be eliminated, 
of course, by definition.

(I think to be correct this should be applied to 
vectors of answer bindings, in fact, rather than 
pairwise to single bindings.)

I am sure that in the worst case this is a bear 
to compute, but then almost any kind of bnode 
redundancy minimality is destructively hard in 
the worst case because one can code up the 
subgraph problem in it.

>
>b.	?x		?y
>	:bijan	:mochi
>	_:x3		:mochi.
>	_:x3 	:mochi.
>
>This is what we get if take definition 1. There 
>are distinct reasons for each BNode (but the 
>coreference is significant). If you change _:x 
>in the last line of Lean1 to _:y, you get the 
>less distasteful:
>
>c.	?x		?y
>	:bijan	:mochi
>	_:x3		:mochi.
>	_:x4 	:mochi.
>
>So is ii.b redundant or not? It's clearly 
>*lexically* redundant. If we allow that why not 
>ii.a?

The duplicated _:x3 is redundant, yes. I agree 
that c. is right, given the change to _:x.

>Won't this be horribly confusing?

Yes. But the answer you give in d. might be 
dangerously *misleading*, which IMO is a lot 
worse.

>Remember we have no *access* to the *reason* an 
>answer exists, though we could always ask a 
>*different* query (like i).
>
>d.	?x		?y
>	:bijan	:mochi
>
>This is the answer I expect from 2 and 3.

The trouble with this is that it withholds 
potentially important information from the 
querying agent. Consider the following example to 
make the point:

:Mary :met :Bill
:Bill :nationality USA
:Mary :met _:x7
_:x7 :nationality IRAQ

SELECT DISTINCT ?x [:Mary :met ?x]

If you are an government agency trying to keep 
track of who talked to whom, it would be less 
than helpful to be told about Bill, but not 
*anything at all* about the nameless Iraqi who 
Mary has been in communication with, just because 
a semantic theory says that {:Mary, _:x7} is 
technically redundant. DISTINCT in this case was 
likely intended to mean, distinct *people*, and 
there is enough information here to enable a 
human reader to know that _:x7, whoever it is, is 
not Bill, even though an RDF engine might be too 
dumb to figure this out. Getting the answer set 
{Bill, _:x7} in this case tells you that there 
are two individuals about whom something 
detectable is recorded, and if we have told 
bnodes available then it allows a subsequent 
query to ask about the nationality of _:x7. (They 
might be the same, of course, but then so might 
:Bill and :Joe.)  But if the answer set simply 
does not contain _:x7 at all, then there is no 
way that the querier is able to ask for any 
further information about this other person who 
is known to exist.

I would guess that the effect of eliminating this 
much 'redundancy' would simply be a huge pressure 
to not use bnodes in RDF data, since their 
presence would be liable to introduce fatal 
mistakes in query answering; and this in turn 
would make RDF into little more than a weak 
database technology.

I really do think that cases like this are very 
important. In another, parallel, universe I have 
been trying to help persuade some agencies to 
become more interested in SWeb technology, and 
examples like this are central to their concerns, 
which (not surprisingly) have a lot to do with 
being able to handle incomplete and partial 
information.

>
>=========
>Lean2	:bijan :loves :zoe.
>		:zoe :eats :mochi.
>		_:x :hates _:y.
>		_:y :scorns :mochi.
>
>iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>
>(Same as i, but my numbering is getting confusing enough :))
>
>
>	?x		?p		?y
>	:bijan	:loves	:zoe
>	:zoe	:eats	:mochi.
>	_:x3		:hates	_:x4.
>	_:x4	:scorns	:mochi.
>
>Same for all for the same reasons as before.
>
>iv) SELECT DISTINCT ?x ?y {?x ?p ?y}
>
>a.	?x		?y
>	:bijan	:zoe
>	:zoe	:mochi.
>	_:x3		_:x4.
>	_:x4 	:mochi.
>
>I think on 1 and 1' this is the correct answer.
>
>b.	?x		?y
>	:bijan	:zoe
>	:zoe	:mochi.
>
>I believe on 2 or 3, this is the correct answer. 
>Consider pairwise subsumption. Line 1 of iv.a 
>simple entails line 3. Line 2 entails line 4. 
>Thus, lines 3 and 4 are redundant.

That is exactly why using simple pairwise 
subsumption on the answer bindings, in isolation 
from the source graph (technically, the scoping 
graph) isn't an appropriate mechanism to use to 
detect 'real' redundancy. Or at any rate, its not 
if we are going to allow told bnodes (which we 
currently do). Because the scope of told bnodes 
isn't a single answer document: it has to include 
the entire session (ie all the answer documents 
and queries in the session), which amounts to it 
being the scoping graph itself. The entailments 
that determine subsumption are then 'inside' this 
graph (inside a single existential scope) rather 
than between separate graphs (different 
existential scopes).

>If we take the coreference into account, lines 3 
>and 4 *together* are subsumed by lines 1 and 2.
>
>So, I'm introducing a notion of entailment here 
>to define subsumption, in some sense.

No quarrel with that, but you have to be careful 
about how your bnodeIDs are scoped. Told bnodes 
are *the same* between queries, so you have to 
include enough of the 'context' of the scoping 
graph when you test the subsumptions. The only 
general way I can see of doing this is to include 
it all, as we did in the definition of 
E-entailment.

>Let's turn an answer set into a graph using the following template:
>
>Each row gets a fresh bnode.
>Each column header gets a uri by a base concated 
>with #variable + the variable name
>Each value is itself.

Im not sure what this translation is supposed to 
be representing or what its significance is.


>
>Each row is translated as follows:
>
>	rowurr	columnuri	value.
>
>So,
>
>iv.a'
>	_:row1 :variable?x	:bijan.
>	_:row1 :variable?y	:zoe.
>	_:row2 :variable?x	:zoe.
>	_:row2 :variable?y	:mochi.
>	_:row3 :variable?x	_:x3.
>	_:row3 :variable?y	_:x4.
>	_:row4 :variable?x	_:x4.
>	_:row4 :variable?y	:mochi.
>
>	_:row1 :variable?x	:bijan.
>	_:row1 :variable?y	:zoe.
>	_:row2 :variable?x	:zoe
>	_:row2 :variable?y	:mochi.
>
>
>and
>
>iv.b'
>	_:row1 :variable?x	:zoe.
>	_:row1 :variable?y	:zoe.
>	_:row2 :variable?x	:zoe.
>	_:row2 :variable?y	:mochi.
>
>Is iv.a' lean? No. Is iv.b'? Yes. They are also 
>simple equivalent in that each simply entails 
>the other.
>
>	iv.a' entails iv.b' because iv.b' is a 
>subgraph of iv.a'. (Subgraph lemma). Obvious 
>since the first 4 lines in the same.
>
>	iv.b' entails iv.a' because iv.b' is an 
>instance of iv.a' (Instance lemma) with the 
>following substituion:
>
>	_:row1 for _:row3
>	_:row2 for _:row4
>	:zoe for	_:x4
>	:bijan for _:x3 .
>
>Which is what we'd expect.
>
>I hope I've made clear the various positions and 
>provided enough test cases to help people 
>understand the various distinctions.
>
>Questions, corrections, and clarifications welcome.

A question to clarify intuitions, although it is 
not directly applicable to SPARQL. Consider the 
following:

:Bill :p :a
:Joe :p :a
:Bill owl:sameAs :Joe

SELECT DISTINCT ?x [?x :p :a]

If we are using OWL-DL entailment, should the 
answer set have two bindings or one?

An OWL reasoner should be able to figure out that 
there is only one entity here in all models of 
the data, so surely being told two names for this 
one guy is a semantic redundancy, no? (Why not?) 
And yet I think that any useful query regime 
should give both answer bindings, simply because 
the URIs are distinct. The querying agent might 
not know that they are coreferential, so to be 
told only one of the names might withhold 
valuable information.

Is this case fundamentally different from the 
above examples of coreference/redundancy between 
a bnode and a URI? Why?

Pat
-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 17 August 2006 05:00:26 UTC