W3C home > Mailing lists > Public > public-rdf-dawg@w3.org > July to September 2006

Re: Summary of BNode redundancy options (at the moment)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Thu, 17 Aug 2006 08:12:53 +0100
Message-Id: <B5D7E7B8-42E8-4235-ADBB-60BB108926E7@cs.man.ac.uk>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
To: Pat Hayes <phayes@ihmc.us>

On Aug 17, 2006, at 6:00 AM, Pat Hayes wrote:

>> (As I understand them.)
>>
>> This is so y'all down have to plow through the debate. In general,  
>> I presume that an answer set is not correct as an answer to a  
>> DISTINCT query if it is redundant with respect to the notion of  
>> redundancy the particular DISTINCT key answers to.
>> ==================
>>
>> BNODE REDUNDANCY:
>>
>> 1) An answer set is redundant (with respect to BNodes) iff the  
>> original graph is not lean.
>>
>> That is, algebraic operations never introduce BNode redundancy,  
>> thus no minimization after operations needs to attend to BNodes.
>>
>> This is roughly the position that I understand Andy and Pat to hold.
>
> Another way to put it is, an answer set is not redundant if a  
> 'larger' query (more triples in the query pattern) exists which  
> would produce an answer set which would not be redundant, from the  
> same KB. [Later: a better way to put it is below]
>
>> There is a different formulation:
>>
>> 1') An answer set is redundant iff an answer doesn't tell us  
>> something distinct about the graph.
>>
>> They may want to restrict this to BNodes only, but I'm hard  
>> pressed to see why.
>
> ? Does applying this to non-bnodes have any effect?

Yes. See ii.a

> A related question below.
>
>> On this reading, actually, there is no difference between lean and  
>> non-lean.
>
> Er...not sure I agree with this.

Well strictly on 1' it doesn't. You could add "doesn't tell us  
something distinct about the lean graph". Which is probably what you  
want.

>> If we have a graph redundancy then the "redundant" answer tell us  
>> something about the graph.
>
> Well, that isn't what I was arguing for. The key point that worries  
> me is that a nonredundant, lean, graph can still give apparently  
> redundant answers,

Yes, I believe that algebraic operations can, and typically do,  
introduce redundancy.

> in effect because the query is being answered with respect to a non- 
> lean subgraph (the triples that match with the query pattern).

I don't see that. Example? The projection case seems safe enough.

> So, in this case, should we say that the answer is redundant or  
> not? I would argue not (unless we forbid told bnodes, and maybe  
> even then); but we should discuss the pros and cons.

I don't understand the case, so I don't have any intuition at the  
moment.

>> ======
>> 2) An answer set is redundant (with respect to BNode) iff some  
>> answer entails another answer
>>
>> 3) An  answer set is redundant (with respect to Bnodes and  
>> coreference) iff some "co-referenced" set of answers entails  
>> another "co-referenced" set of answers.
>
> Can you explain "co-referenced" here? (There is no way to establish  
> coreference without owl:sameAs or something equivalent, AFAIK.)

Sorry, it was short hand for "co-reference of bnodes between answer  
sets"

The idea was that on a pairwise reading:

	?x		?y
	:bijan	:mochi
	_:x		:mochi
	_:x		:gaz

Line two disappears. If we say, instead, that the set of answers that  
have _:x have to be subsumed by the remaining triples, then this  
answer would hold.

[snip]
>> i) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>>
>> 	?x		?p		?y
>> 	:bijan	:loves	:mochi
>> 	:bijan	:eats	:mochi.
>> 	_:x3		:hates	:mochi.
>> 	_:x3	:scorns	:mochi.
>>
>> This is the same answer for 1, 2 and 3. The second variable is  
>> unique in every row, thus each answer is different.
>
> Agreed.
>
>>
>> ii) SELECT DISTINCT ?x ?y {?x ?p ?y}
>>
>> a.	?x		?y
>> 	:bijan	:mochi
>> 	:bijan	:mochi.
>> 	_:x3		:mochi.
>> 	_:x3 	:mochi.
>>
>> This is what we get if we take definition 1' as holding uniformly.  
>> There are distinct reasons for each line.
>
> I would be, well, not happy, but maybe complacent with this, in  
> fact, even though I agree it seems kind of silly.

I think it's entirely broken and horribly counterintuitive. It breaks  
with SQL hard, for example.

> But I would be happier with the answer
>
> ?x	?y
> :bijan	:mochi
> _:x	:mochi
>
> with the following justification: the DISTINCT requests that  
> duplicates be removed from the answer set, and ii.a contains two  
> duplications, which is gratuitous redundancy that serves no useful  
> purpose.

It does, by 1'. It tells you that there are two, distinct,  
nonredundant ways to reach each of :bijan :mochi and _:x3 :mochi.  
This seems *exactly* the same justifiction for ii.c.

> The point I was urging was that it would not, in this case, be  
> appropriate to reduce this to a single answer on the grounds  
> that :bijan is an instance of _:x, since the KB has enough  
> information in it to distinguish :bijan from _:x, and this could be  
> detected by a subsequent query using _:x as a told bnode.

Actually, it doesn't really. For all we know, :bijan also hates and  
scorns mochi. But this is an interesting refinement, that is, the  
graph tells us enough to distinguish the two *terms*, rather than the  
two *answers*. (Though, it gives me enough to distiguish that one  
token of :bijan occurs in the graph to support an answer and so does  
another.)

> Nice example, because it suggests a way to phrase the criterion I  
> have been struggling to articulate. Say that two terms a and b are  
> 'distinguishable' (wrt the graph) when there exists a pair of  
> queries which differ only in that one contains a where the other  
> contains b, and give different answer sets. Then the criterion is  
> that *distinguishable answer bindings are never counted as mutually  
> redundant*. Given this, any other redundancy can be eliminated by a  
> DISTINCT. Thus, if the original graph is equivalent to the instance  
> _:x/aa of itself, so is redundant, then this redundancy can be  
> eliminated in any DISTINCT answer set; but apparent bnode/URI  
> redundancies which do not in fact arise from any non-leanness in  
> the graph itself cannot be eliminated. This allows all simple  
> duplications to be eliminated, of course, by definition.

I need to reflect on it, but first cut seems not broken, technically.  
I just very strongly disagree with it :)

> (I think to be correct this should be applied to vectors of answer  
> bindings, in fact, rather than pairwise to single bindings.)
>
> I am sure that in the worst case this is a bear to compute, but  
> then almost any kind of bnode redundancy minimality is  
> destructively hard in the worst case because one can code up the  
> subgraph problem in it.

Well, that's the other advantage of mine, since it can be applied  
post facto, so only on the answer set, which tends to be small. (And  
you only touch relevant redundancy).


>>
>> b.	?x		?y
>> 	:bijan	:mochi
>> 	_:x3		:mochi.
>> 	_:x3 	:mochi.
>>
>> This is what we get if take definition 1. There are distinct  
>> reasons for each BNode (but the coreference is significant). If  
>> you change _:x in the last line of Lean1 to _:y, you get the less  
>> distasteful:
>>
>> c.	?x		?y
>> 	:bijan	:mochi
>> 	_:x3		:mochi.
>> 	_:x4 	:mochi.
>>
>> So is ii.b redundant or not? It's clearly *lexically* redundant.  
>> If we allow that why not ii.a?
>
> The duplicated _:x3 is redundant, yes. I agree that c. is right,  
> given the change to _:x.
>
>> Won't this be horribly confusing?
>
> Yes. But the answer you give in d. might be dangerously  
> *misleading*, which IMO is a lot worse.
>
>> Remember we have no *access* to the *reason* an answer exists,  
>> though we could always ask a *different* query (like i).
>>
>> d.	?x		?y
>> 	:bijan	:mochi
>>
>> This is the answer I expect from 2 and 3.
>
> The trouble with this is that it withholds potentially important  
> information from the querying agent. Consider the following example  
> to make the point:
>
> :Mary :met :Bill
> :Bill :nationality USA
> :Mary :met _:x7
> _:x7 :nationality IRAQ
>
> SELECT DISTINCT ?x [:Mary :met ?x]
>
> If you are an government agency trying to keep track of who talked  
> to whom, it would be less than helpful to be told about Bill, but  
> not *anything at all* about the nameless Iraqi who Mary has been in  
> communication with,

Who, for all you know, is :Bill. You are importing implicit knowledge  
not actually encoded in the kb. This is why I think it's different  
from RDF Semantics, at least in part.

> just because a semantic theory says that {:Mary, _:x7}

{:Bill, _:x7}?

> is technically redundant. DISTINCT in this case was likely intended  
> to mean, distinct *people*, and there is enough information here to  
> enable a human reader to know that _:x7, whoever it is, is not Bill,

But not from the answer set. Another way to consider the projection  
is to say, "I don't *care* about the information projected away". At  
least, in so far as it's distinguishing answers. That seems to be  
what it is in SQL.

> even though an RDF engine might be too dumb to figure this out.

It certainly is :) You need that theory which says that one can't  
have both a USA and an IRAQI nationality (and for all I know, you can!)

> Getting the answer set {Bill, _:x7} in this case tells you that  
> there are two individuals

No it doesn't.

Actually, it doesn't. Right? I mean, it could be Bill. There is a  
model in which _:x7 gets mapped to what Bill gets mapped to.  
Entailment, of course, is about all models and it's possible to have  
a model where they are mapped to distinct elements. But this answer  
really doesn't tell you anything about how many individuals there are  
other than 1.

Oops, _:x7 doesn't get mapped to distinct elements. It could always  
get mapped to :Bill even if it also gets mapped to some other element  
as well. Existential, y'know.

(I believe that RDF graphs under simple interpretations have the "one  
element" model property. If a RDF graph has a simple interpretation  
that is a model (and they all do) then it has a model with a single  
element. With RDF intepretations you need at least two, I think...one  
for the XMLLiterals and one for a not XMLLiteral so that if you have  
both a well formed and illformed XMLliteral, you have distinct  
objects to map them to. Ah! I see not, because you actually map them  
to their XML value. In principle you don't have to, though.)

> about whom something detectable is recorded, and if we have told  
> bnodes available then it allows a subsequent query to ask about the  
> nationality of _:x7.

Well, no. Since we don't have BNode stability through different  
queries. (This might mean that you get different answers if you  
minimize between algebraic operations than just at the end. But I  
think it's safe if you do it just at the end.)

> (They might be the same, of course, but then so might :Bill and :Joe.)

This is true. To distinguish them in every model we need some form of  
negation. One element model property, after all.

>  if the answer set simply does not contain _:x7 at all, then there  
> is no way that the querier is able to ask for any further  
> information about this other person who is known to exist.

Is the difference that while :Bill and :Joe might be mapped to the  
same element (i.e., :Bill = :Joe), _:x7 is *not* (necessarily) mapped  
to a *single* element. That's what this semantic condition says:

	""If E is an RDF graph then I(E) = true if [I+A'](E) = true for some  
mapping A' from blank(E) to IR, otherwise I(E)""

I emphasize that because one way to consider the difference between  
BNodes and URIs/Literals is to their sensitivity to further  
assumptions (whether actually imposed, or just held). Consider the  
unique name assumption (UNA). If we were to add that to RDF,  
then :Bill and :Joe *would* denote two distinct individuals in every  
model. But :Bill _:x7 could have a map that mapped it only to :Bill  
in some model.

I think that the reading of BNodes as distinguishable solely by their  
identity (or their lexical form, in a concrete syntax), and, in  
particular, as supporting the heuristic that there is a distinct  
individual, is to apply a kind of UBnodeA to them. Which is to treat  
them rather like an N.

Personally, I found the answers are distinguishable by their proofs  
more plausible and easy to understand, but also just a way of getting  
redundant answer sets.

> I would guess that the effect of eliminating this much 'redundancy'  
> would simply be a huge pressure to not use bnodes in RDF data,

Then I think you support a "local name" view of BNodes. In fact, I  
don't see why you endorse leaned graphs. A human reader might well  
think that:

	:mary :met _:x.
	:mary :met _:y.

reasonably "suggests" that mary met two people. And we should  
communicate that in the answer set. But that graph is not lean. By  
why is leaning justified if leaning the answers isn't?

I really can't find a principle that gets your preferred answers that  
1) coheres in a non ad hoc way, and 2) doesn't at least implicitly  
regard BNodes as local names. And I think the latter is not what the  
RDF semantics says.

I happen to think we'd all be better off if that was what the RDF  
semantics *did* say.

Note, you can get your behavior by having two graphs, equivalent. One  
lean and one not. SELECT on the non-lean gives you the normal SELECT  
ALL behavior on the non-lean. SELECT on the lean, gives you *your  
desired* SELECT DISTINCT behavior wrt to the non-lean, and SELECT  
bijan DISTINCT gives the same answers in either case.

If it's ok to swtich graphs for entailment, why not switch graphs for  
leanness. You've not lost a thing.

> since their presence would be liable to introduce fatal mistakes in  
> query answering;

Well, obviously I don't agree, though in either case one must take  
care in explaining what's happening. I think mine makes it easier to  
understand the meaning of the data and the query so that you know  
exactly what you are assuming and what you are not. And I think it's  
important to support that or change the RDF semantics to what  
everyone uses and expects.

> and this in turn would make RDF into little more than a weak  
> database technology.

Well, that's how everyone in practice uses it. So let's make these  
cohere!

I would love to know where for you the existential reading makes  
BNodes worth the computational and cognitive cost. (Remember, I'm not  
afraid of high complexity. But everyone else argues that both the  
complexity and the actual implementations are not things they want to  
do, by and large.)

(Bnodes could be much better than nulls without having to be  
existentials.)

> I really do think that cases like this are very important. In  
> another, parallel, universe I have been trying to help persuade  
> some agencies to become more interested in SWeb technology, and  
> examples like this are central to their concerns, which (not  
> surprisingly) have a lot to do with being able to handle incomplete  
> and partial information.

Well, I have to see the use case where you need both the existential  
reading to handle the cases of incomplete and partial information  
you've given. :mary is a case where its not useful and I'd argue is  
not what the writer likely intended. Unless they intended *by that*  
that mary might have met many Iraqis or that it might have been Bob  
who was the iraqi she met.

[snip]
>> iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>>
>> (Same as i, but my numbering is getting confusing enough :))
>>
>>
>> 	?x		?p		?y
>> 	:bijan	:loves	:zoe
>> 	:zoe	:eats	:mochi.
>> 	_:x3		:hates	_:x4.
>> 	_:x4	:scorns	:mochi.
>>
>> Same for all for the same reasons as before.
>>
>> iv) SELECT DISTINCT ?x ?y {?x ?p ?y}
[snip]
>> b.	?x		?y
>> 	:bijan	:zoe
>> 	:zoe	:mochi.
>>
>> I believe on 2 or 3, this is the correct answer. Consider pairwise  
>> subsumption. Line 1 of iv.a simple entails line 3. Line 2 entails  
>> line 4. Thus, lines 3 and 4 are redundant.
>
> That is exactly why using simple pairwise subsumption

Note that it also works on respecting the bnode coreference between  
answers.

> on the answer bindings, in isolation from the source graph  
> (technically, the scoping graph) isn't an appropriate mechanism to  
> use to detect 'real' redundancy.

?

> Or at any rate, its not if we are going to allow told bnodes (which  
> we currently do).

See above.

> Because the scope of told bnodes isn't a single answer document: it  
> has to include the entire session

Please show me in the document where we have sessions. I wanted them,  
but we don't have them.

> (ie all the answer documents and queries in the session), which  
> amounts to it being the scoping graph itself. The entailments that  
> determine subsumption are then 'inside' this graph (inside a single  
> existential scope) rather than between separate graphs (different  
> existential scopes).

Well, I don't think that's the only way.

>> If we take the coreference into account, lines 3 and 4 *together*  
>> are subsumed by lines 1 and 2.
>>
>> So, I'm introducing a notion of entailment here to define  
>> subsumption, in some sense.
>
> No quarrel with that, but you have to be careful about how your  
> bnodeIDs are scoped. Told bnodes are *the same* between queries,

I really see no evidence of this anywhere in the spec. I was in the  
believe that the current spec did not allow this or even for it  
except by not explicitly ruling out.

> so you have to include enough of the 'context' of the scoping graph  
> when you test the subsumptions. The only general way I can see of  
> doing this is to include it all, as we did in the definition of E- 
> entailment.
>
>> Let's turn an answer set into a graph using the following template:
>>
>> Each row gets a fresh bnode.
>> Each column header gets a uri by a base concated with #variable +  
>> the variable name
>> Each value is itself.
>
> Im not sure what this translation is supposed to be representing or  
> what its significance is.

I'm showing how the redundancy of an answer set (in my sense) is  
detectable by graph redundancy, so I can reduce my notion of answer  
subsumption to graph entailment. And that shows where I think the  
scope of the bnodes is.

Just imagine the appropriate construct.

[snip]
> A question to clarify intuitions, although it is not directly  
> applicable to SPARQL. Consider the following:
>
> :Bill :p :a
> :Joe :p :a
> :Bill owl:sameAs :Joe
>
> SELECT DISTINCT ?x [?x :p :a]
>
> If we are using OWL-DL entailment, should the answer set have two  
> bindings or one?

Good question and not one that I've seen asked or dealt with. But I  
think there's a mild UNA wrt answer set redundancy implicit in most  
treatments of query answering against DLs. Or perhaps, no one has  
thought hard about distinct.

Thinking about it, and how we handle these things in general, I'd  
prefer 1 answer, but of the following form:

	?x
	:Bill = :Joe (or :Bill/:Joe, or some such).

But I wouldn't want two answers. Because there aren't two *answers*.

Actually, I'd use that for without the projection. Consider:

	:Bill :r1 :a.
	:Joe :r2 :a.
	:Bill owl:sameAs :Joe

SELECT DISTINCT ?x ?p {?x ?p :a}
	?x			?p
	:Bill = :Joe 	:r1
	:Bill = :Joe	:r2

But I've not though it through.

> An OWL reasoner should be able to figure out that there is only one  
> entity here in all models of the data, so surely being told two  
> names for this one guy is a semantic redundancy, no?

It is.

> (Why not?)

Well, I didn't say it was not :)

> And yet I think that any useful query regime should give both  
> answer bindings, simply because the URIs are distinct.

Well see above :)

> The querying agent might not know that they are coreferential, so  
> to be told only one of the names might withhold valuable information.

Probably not, since you could always ask for all the instances :Bill  
is sameas, in all OWL systems I know. The problem is that picking  
which name is arbitrary, so could fluctuate without change to the data.

> Is this case fundamentally different from the above examples of  
> coreference/redundancy between a bnode and a URI? Why?

It's just redundancy. I use "coreference" only for appearance of the  
same bnode in different answers. Sorry for that confusion.

I think in the case of the open world assumption (and names),  
DISTINCT has to be something like, wrt to equality, "as far as I know  
distinct". So without the sameas, two answers. With the sameas one  
answer.

Removing names changes the signature, and that's typically a pretty  
significant move. Pruning existentials do not. Ok, I'm sleep deprived  
and just riffing here.

Cheers,
Bijan.
Received on Thursday, 17 August 2006 07:13:15 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 16:15:27 GMT