Summary of BNode redundancy options (at the moment) from Bijan Parsia on 2006-08-16 (public-rdf-dawg@w3.org from July to September 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Wed, 16 Aug 2006 12:57:40 +0100
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <DF0BA59F-7E8C-4CDB-BF2C-C391D05CEB4D@cs.man.ac.uk>
(As I understand them.)

This is so y'all down have to plow through the debate. In general, I  
presume that an answer set is not correct as an answer to a DISTINCT  
query if it is redundant with respect to the notion of redundancy the  
particular DISTINCT key answers to.
==================

BNODE REDUNDANCY:

1) An answer set is redundant (with respect to BNodes) iff the  
original graph is not lean.

That is, algebraic operations never introduce BNode redundancy, thus  
no minimization after operations needs to attend to BNodes.

This is roughly the position that I understand Andy and Pat to hold.

There is a different formulation:

1') An answer set is redundant iff an answer doesn't tell us  
something distinct about the graph.

They may want to restrict this to BNodes only, but I'm hard pressed  
to see why. On this reading, actually, there is no difference between  
lean and non-lean. If we have a graph redundancy then the "redundant"  
answer tell us something about the graph.

======
2) An answer set is redundant (with respect to BNode) iff some answer  
entails another answer

3) An  answer set is redundant (with respect to Bnodes and  
coreference) iff some "co-referenced" set of answers entails another  
"co-referenced" set of answers.

	So there are two possibilities: We treat each answer independently  
and look for subsumptions (one answer subsumes another iff the first  
entails the second; tricky issues with empty bindings to consider)  
pairwise. Or, we have to consider *sets* of answers, because  
coreference allows us to see *in the answer set* that an answer isn't  
redundant. The first is obviously easier to compute, but the answers  
*might* be a bit hard to interpret. The latter is more difficult to  
compute (an specify) but the answers might be more "pleasing".

Obviously, the notion of "co-referenced" set of answers (a horrid  
term) must be defined clearly, but in this message I want to just  
give some examples and use cases to give folks the gist.

=====================================

Lean1	:bijan :loves :mochi.
		:bijan :eats :mochi.
		_:x :hates :mochi.
		_:x :scorns :mochi.


i) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}

	?x		?p		?y
	:bijan	:loves	:mochi
	:bijan 	:eats 	:mochi.
	_:x3		:hates 	:mochi.
	_:x3 	:scorns 	:mochi.

This is the same answer for 1, 2 and 3. The second variable is unique  
in every row, thus each answer is different.

ii) SELECT DISTINCT ?x ?y {?x ?p ?y}

a.	?x		?y
	:bijan	:mochi
	:bijan 	:mochi.
	_:x3	 	:mochi.
	_:x3  	:mochi.

This is what we get if we take definition 1' as holding uniformly.  
There are distinct reasons for each line.

b.	?x		?y
	:bijan	:mochi
	_:x3	 	:mochi.
	_:x3  	:mochi.

This is what we get if take definition 1. There are distinct reasons  
for each BNode (but the coreference is significant). If you change  
_:x in the last line of Lean1 to _:y, you get the less distasteful:

c.	?x		?y
	:bijan	:mochi
	_:x3	 	:mochi.
	_:x4  	:mochi.

So is ii.b redundant or not? It's clearly *lexically* redundant. If  
we allow that why not ii.a? Won't this be horribly confusing?

Remember we have no *access* to the *reason* an answer exists, though  
we could always ask a *different* query (like i).

d. 	?x		?y
	:bijan	:mochi

This is the answer I expect from 2 and 3.

=========
Lean2 	:bijan :loves :zoe.
		:zoe :eats :mochi.
		_:x :hates _:y.
		_:y :scorns :mochi.

iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}

(Same as i, but my numbering is getting confusing enough :))


	?x		?p		?y
	:bijan	:loves	:zoe
	:zoe 	:eats 	:mochi.
	_:x3		:hates 	_:x4.
	_:x4 	:scorns 	:mochi.

Same for all for the same reasons as before.

iv) SELECT DISTINCT ?x ?y {?x ?p ?y}

a.	?x		?y
	:bijan	:zoe
	:zoe 	:mochi.
	_:x3	 	_:x4.
	_:x4  	:mochi.

I think on 1 and 1' this is the correct answer.

b.	?x		?y
	:bijan	:zoe
	:zoe 	:mochi.

I believe on 2 or 3, this is the correct answer. Consider pairwise  
subsumption. Line 1 of iv.a simple entails line 3. Line 2 entails  
line 4. Thus, lines 3 and 4 are redundant. If we take the coreference  
into account, lines 3 and 4 *together* are subsumed by lines 1 and 2.

So, I'm introducing a notion of entailment here to define  
subsumption, in some sense. Let's turn an answer set into a graph  
using the following template:

Each row gets a fresh bnode.
Each column header gets a uri by a base concated with #variable + the  
variable name
Each value is itself.

Each row is translated as follows:

	rowurr	columnuri	value.

So,

iv.a'
	_:row1 :variable?x	:bijan.
	_:row1 :variable?y	:zoe.
	_:row2 :variable?x	:zoe.
	_:row2 :variable?y	:mochi.
	_:row3 :variable?x	_:x3.
	_:row3 :variable?y	_:x4.
	_:row4 :variable?x	_:x4.
	_:row4 :variable?y	:mochi.

	_:row1 :variable?x	:bijan.
	_:row1 :variable?y	:zoe.
	_:row2 :variable?x	:zoe
	_:row2 :variable?y	:mochi.


and

iv.b'
	_:row1 :variable?x	:zoe.
	_:row1 :variable?y	:zoe.
	_:row2 :variable?x	:zoe.
	_:row2 :variable?y	:mochi.

Is iv.a' lean? No. Is iv.b'? Yes. They are also simple equivalent in  
that each simply entails the other.

	iv.a' entails iv.b' because iv.b' is a subgraph of iv.a'. (Subgraph  
lemma). Obvious since the first 4 lines in the same.

	iv.b' entails iv.a' because iv.b' is an instance of iv.a' (Instance  
lemma) with the following substituion:

	_:row1 for _:row3
	_:row2 for _:row4
	:zoe for 	_:x4
	:bijan for _:x3 .

Which is what we'd expect.

I hope I've made clear the various positions and provided enough test  
cases to help people understand the various distinctions.

Questions, corrections, and clarifications welcome.

I don't believe this fully discharges my action items, but it may  
handle """*ACTION:* bijan to show that the "strong" version of  
DISTNCT doesn't interfere with intermittent algebraic operations  
[recorded in
http://www.w3.org/2006/08/15-dawg-minutes.html#action01]"""

I mean, it's not a proof, but I can't see how any of my definitions,  
partial though they may be, would ever require touching intermittent  
operations. A counterexample would be helpful! (Or an example where  
the postprocessing approach has unexpected results.) Blanks have to  
be taken into account, but I don't know that they are on any of the  
above views thus far.

Cheers,
Bijan.
Received on Wednesday, 16 August 2006 11:57:58 UTC