Re: Summary of BNode redundancy options (at the moment) from Pat Hayes on 2006-08-17 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Thu, 17 Aug 2006 02:29:33 -0700
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230945c109d2c0f3bc@[192.168.1.6]>
>
>>The point I was urging was that it would not, in this case, be 
>>appropriate to reduce this to a single answer on the grounds that 
>>:bijan is an instance of _:x, since the KB has enough information 
>>in it to distinguish :bijan from _:x, and this could be detected by 
>>a subsequent query using _:x as a told bnode.
>
>Actually, it doesn't really. For all we know, :bijan also hates and 
>scorns mochi.

Yes, of course. What I meant by 'distinguish' was 'can deliver 
information about one that it cannot about the other', not 'can prove 
they are distinct'. As you note below, its impossible to prove in RDF 
that any two names are distinct or that they are equal. BUt we can 
still 'distinguish' a from b in this graph:

a p c
b p c
b q d

in a way that we cannot in the subgraph without the third triple.

>But this is an interesting refinement, that is, the graph tells us 
>enough to distinguish the two *terms*, rather than the two *answers*.

To distinguish the two terms, exactly. And the terms *are* the 
answers. (Really, by definition, they are.)

>(Though, it gives me enough to distiguish that one token of :bijan 
>occurs in the graph to support an answer and so does another.)
>
>>Nice example, because it suggests a way to phrase the criterion I 
>>have been struggling to articulate. Say that two terms a and b are 
>>'distinguishable' (wrt the graph) when there exists a pair of 
>>queries which differ only in that one contains a where the other 
>>contains b, and give different answer sets. Then the criterion is 
>>that *distinguishable answer bindings are never counted as mutually 
>>redundant*. Given this, any other redundancy can be eliminated by a 
>>DISTINCT. Thus, if the original graph is equivalent to the instance 
>>_:x/aa of itself, so is redundant, then this redundancy can be 
>>eliminated in any DISTINCT answer set; but apparent bnode/URI 
>>redundancies which do not in fact arise from any non-leanness in 
>>the graph itself cannot be eliminated. This allows all simple 
>>duplications to be eliminated, of course, by definition.
>
>I need to reflect on it, but first cut seems not broken, 
>technically. I just very strongly disagree with it :)

[snip]
>>
>>The trouble with this is that it withholds potentially important 
>>information from the querying agent. Consider the following example 
>>to make the point:
>>
>>:Mary :met :Bill
>>:Bill :nationality USA
>>:Mary :met _:x7
>>_:x7 :nationality IRAQ
>>
>>SELECT DISTINCT ?x [:Mary :met ?x]
>>
>>If you are an government agency trying to keep track of who talked 
>>to whom, it would be less than helpful to be told about Bill, but 
>>not *anything at all* about the nameless Iraqi who Mary has been in 
>>communication with,
>
>Who, for all you know, is :Bill.

Might be, might not be. But since you don't know, and since there is 
information about him that isnt available about Bill, you ought to be 
told about him. Now, if the KB could *prove* that _:x7 was Bill, then 
I'd agree with pruning the answers, but that would be a very strong 
notion of DISTINCT (and a bitch to compute, of course, when it meant 
anything at all.)

>You are importing implicit knowledge not actually encoded in the kb.

No, Im not importing anything. Im just saying that there are two 
terms and we know different stuff about whatever they denote. That 
leaves the question of their semantic identity open.

>  This is why I think it's different from RDF Semantics, at least in part.
>
>>just because a semantic theory says that {:Mary, _:x7}
>
>{:Bill, _:x7}?

Yes, sorry.

>
>>is technically redundant. DISTINCT in this case was likely intended 
>>to mean, distinct *people*, and there is enough information here to 
>>enable a human reader to know that _:x7, whoever it is, is not Bill,
>
>But not from the answer set. Another way to consider the projection 
>is to say, "I don't *care* about the information projected away". At 
>least, in so far as it's distinguishing answers. That seems to be 
>what it is in SQL.

Well, maybe it is, but SQL didn't have bnodes to worry about. Im not 
sure that this 'projection' way of thinking even makes the same kind 
of sense in this setting as it does in SQL.

>>even though an RDF engine might be too dumb to figure this out.
>
>It certainly is :) You need that theory which says that one can't 
>have both a USA and an IRAQI nationality (and for all I know, you 
>can!)

True. But my point is that real human users will be very pissed off 
if engines refuse to show them potentially relevant answers.

>>Getting the answer set {Bill, _:x7} in this case tells you that 
>>there are two individuals
>
>No it doesn't.
>
>Actually, it doesn't. Right? I mean, it could be Bill.

Yes, yes. I spoke carelessly. But the same is true for ANY pair of 
answers. :Bill might be the same guy as :Osama. Its quite rare to be 
able to conclusively establish that two ground terms cannot possibly 
denote the same thing. But that isn't usually taken to be a good 
reason to not deliver all the syntactically distinct ground bindings 
to a query. The point is that *different information* is available 
about them (or, if they co-refer, about *it* under its different 
names.)

>There is a model in which _:x7 gets mapped to what Bill gets mapped 
>to. Entailment, of course, is about all models and it's possible to 
>have a model where they are mapped to distinct elements. But this 
>answer really doesn't tell you anything about how many individuals 
>there are other than 1.
>
>Oops, _:x7 doesn't get mapped to distinct elements. It could always 
>get mapped to :Bill

Not if the extension of :nationality doesn't contain a pair <Bill, Iraq>.

>even if it also gets mapped to some other element as well. 
>Existential, y'know.

Existential has nothing to do with it. The same reasoning would apply 
if that were a URIref. It might corefer with :Bill, or it might not, 
depending on what constraints the rest of the interpretation imposes 
on it. There isn't enough information in the graph to make a clear 
determination one way or the other. This is obviously going to be the 
'normal' case.

>(I believe that RDF graphs under simple interpretations have the 
>"one element" model property. If a RDF graph has a simple 
>interpretation that is a model (and they all do) then it has a model 
>with a single element.

The same is true for any ground set of facts without negation, such as all DBs.

>With RDF intepretations you need at least two, I think...one for the 
>XMLLiterals and one for a not XMLLiteral so that if you have both a 
>well formed and illformed XMLliteral, you have distinct objects to 
>map them to. Ah! I see not, because you actually map them to their 
>XML value. In principle you don't have to, though.)
>
>>about whom something detectable is recorded, and if we have told 
>>bnodes available then it allows a subsequent query to ask about the 
>>nationality of _:x7.
>
>Well, no. Since we don't have BNode stability through different queries.

That stability is exactly what I mean by 'told bnodes'. It allows a 
later query to re-use a bnode delivered as an answer binding in a 
previous query, to get more information about the entity the bnode 
denotes. It amounts to treating the scope of the bnodeID as extending 
over several queries. This is allowed by the current spec (you do it 
by the scoping graph being stable for several queries, instead of a 
new equivalent graph with fresh bnodes for each query. The 
definitions only require the scoping graph to be equivalent to the 
source graph, they do not require it to be bnode-renamed from any 
other graph, so that one can re-use the same scoping graph for a 
series of queries.)

>  (This might mean that you get different answers if you minimize 
>between algebraic operations than just at the end. But I think it's 
>safe if you do it just at the end.)
>
>>(They might be the same, of course, but then so might :Bill and :Joe.)
>
>This is true. To distinguish them in every model we need some form 
>of negation. One element model property, after all.

And that is not usually taken to be a reason to not deliver both 
answers and consider them DISTINCT, right? So DISTINCT means 
something more like 'different in syntactic form' than 'provably not 
equal'.

>>  if the answer set simply does not contain _:x7 at all, then there 
>>is no way that the querier is able to ask for any further 
>>information about this other person who is known to exist.
>
>Is the difference that while :Bill and :Joe might be mapped to the 
>same element (i.e., :Bill = :Joe), _:x7 is *not* (necessarily) 
>mapped to a *single* element. That's what this semantic condition 
>says:
>
>	""If E is an RDF graph then I(E) = true if [I+A'](E) = true 
>for some mapping A' from blank(E) to IR, otherwise I(E)""

The key point is what we count here as 'an RDF graph'. If that 
mapping A' applies across the scope of the inferences we are talking 
about, the fact that we are talking about [I+A'] rather than I is 
irrelevant. So, are we dealing with an inference of (exists (xxxx) 
Ans) from (exists (yyyy) KB), or are we dealing with an inference of 
Ans from KB, with the graph in question being (exists (xxx yyy)(Ans & 
KB)) ? If bnode scopes cross several answers, it has to be the 
second. Note that the current definition of E-entailment doesnt say 
that the answer graph is entailed by the KB, it says that the *union 
of the answer graph and the scoping graph* is entailed.

>I emphasize that because one way to consider the difference between 
>BNodes and URIs/Literals is to their sensitivity to further 
>assumptions (whether actually imposed, or just held). Consider the 
>unique name assumption (UNA). If we were to add that to RDF, then 
>:Bill and :Joe *would* denote two distinct individuals in every 
>model. But :Bill _:x7 could have a map that mapped it only to :Bill 
>in some model.

True, with the usual understanding of UNA. BUt UNA plus existentials 
is itself a rather peculiar idea and needs some careful analysis. For 
example, the usual logical inference rules would no longer apply in 
their usual form.

>I think that the reading of BNodes as distinguishable solely by 
>their identity (or their lexical form, in a concrete syntax), and, 
>in particular, as supporting the heuristic that there is a distinct 
>individual, is to apply a kind of UBnodeA to them. Which is to treat 
>them rather like an N.
>
>Personally, I found the answers are distinguishable by their proofs 
>more plausible and easy to understand, but also just a way of 
>getting redundant answer sets.
>
>>I would guess that the effect of eliminating this much 'redundancy' 
>>would simply be a huge pressure to not use bnodes in RDF data,
>
>Then I think you support a "local name" view of BNodes. In fact, I 
>don't see why you endorse leaned graphs. A human reader might well 
>think that:
>
>	:mary :met _:x.
>	:mary :met _:y.
>
>reasonably "suggests" that mary met two people. And we should 
>communicate that in the answer set. But that graph is not lean. By 
>why is leaning justified if leaning the answers isn't?

Im not arguing against leaning the answers. Im arguing that IF (as we 
do) we allow answer scopes to include the entire graph, then the 
entire graph should be involved in what counts as 'leaning'. And that 
this can produce a situation where answers might *seem* to be 
redundant (non-lean) if viewed locally, when in fact they aren't. And 
in response to the reply that , well, we just *should* use the local 
criterion, I have the objection that if you look at what this does to 
the protocol, it means that potentially important information in the 
KB is rendered invisible, which IMO is a bad design for essentially 
pragmatic reasons.

>I really can't find a principle that gets your preferred answers 
>that 1) coheres in a non ad hoc way, and 2) doesn't at least 
>implicitly regard BNodes as local names. And I think the latter is 
>not what the RDF semantics says.

Im not suggesting changing anything in the RDF semantics.

>I happen to think we'd all be better off if that was what the RDF 
>semantics *did* say.
>
>Note, you can get your behavior by having two graphs, equivalent. 
>One lean and one not. SELECT on the non-lean gives you the normal 
>SELECT ALL behavior on the non-lean. SELECT on the lean, gives you 
>*your desired* SELECT DISTINCT behavior wrt to the non-lean, and 
>SELECT bijan DISTINCT gives the same answers in either case.

I don't follow this. The situation Im worried about only arises when 
the graph is lean.

>If it's ok to swtich graphs for entailment, why not switch graphs 
>for leanness. You've not lost a thing.
>
>>since their presence would be liable to introduce fatal mistakes in 
>>query answering;
>
>Well, obviously I don't agree, though in either case one must take 
>care in explaining what's happening. I think mine makes it easier to 
>understand the meaning of the data and the query so that you know 
>exactly what you are assuming and what you are not. And I think it's 
>important to support that or change the RDF semantics to what 
>everyone uses and expects.
>
>>and this in turn would make RDF into little more than a weak 
>>database technology.
>
>Well, that's how everyone in practice uses it. So let's make these cohere!
>
>I would love to know where for you the existential reading makes 
>BNodes worth the computational and cognitive cost. (Remember, I'm 
>not afraid of high complexity. But everyone else argues that both 
>the complexity and the actual implementations are not things they 
>want to do, by and large.)

? Are you on a crusade here to rebuild RDF without bnodes? If so, 
that seems rather outside our charter, to put it mildly.

>
>(Bnodes could be much better than nulls without having to be existentials.)
>
>>I really do think that cases like this are very important. In 
>>another, parallel, universe I have been trying to help persuade 
>>some agencies to become more interested in SWeb technology, and 
>>examples like this are central to their concerns, which (not 
>>surprisingly) have a lot to do with being able to handle incomplete 
>>and partial information.
>
>Well, I have to see the use case where you need both the existential 
>reading to handle the cases of incomplete and partial information 
>you've given. :mary is a case where its not useful and I'd argue is 
>not what the writer likely intended.

I fail to see what possible grounds you could have for this.

>Unless they intended *by that* that mary might have met many Iraqis 
>or that it might have been Bob who was the iraqi she met.

These possibilities are not excluded, but they are also not known. 
Lack of information often really is simply a lack, one that leaves 
many possibilities open. The intention is not to implicitly deny 
these.

The situation is simply one where a large amount of incomplete 
information is available. Often, nothing is known about who various 
people are, deliberately misleading nicknames are being used, and so 
on. Often, all that is known is that someone (or some event, etc) 
exists, and some facts about he/it are known, but not enough to 
attach a real identifier to it reliably. Sometimes, a weak form of 
the UNA is maintained by a systematic process of inventing suitable 
identifiers (this is how OntoWorks reasoner works), but this needs to 
be backed up by special inferences which basically repair UNA faults 
when identities are discovered. In effect, the UNA is a default 
rather than a strict inference, which re-introduces the existential 
reading.

>
>[snip]
>>>iii) SELECT DISTINCT ?x ?p ?y {?x ?p ?y}
>>>
>>>(Same as i, but my numbering is getting confusing enough :))
>>>
>>>
>>>	?x		?p		?y
>>>	:bijan	:loves	:zoe
>>>	:zoe	:eats	:mochi.
>>>	_:x3		:hates	_:x4.
>>>	_:x4	:scorns	:mochi.
>>>
>>>Same for all for the same reasons as before.
>>>
>>>iv) SELECT DISTINCT ?x ?y {?x ?p ?y}
>[snip]
>>>b.	?x		?y
>>>	:bijan	:zoe
>>>	:zoe	:mochi.
>>>
>>>I believe on 2 or 3, this is the correct answer. Consider pairwise 
>>>subsumption. Line 1 of iv.a simple entails line 3. Line 2 entails 
>>>line 4. Thus, lines 3 and 4 are redundant.
>>
>>That is exactly why using simple pairwise subsumption
>
>Note that it also works on respecting the bnode coreference between answers.
>
>>on the answer bindings, in isolation from the source graph 
>>(technically, the scoping graph) isn't an appropriate mechanism to 
>>use to detect 'real' redundancy.
>
>?
>
>>Or at any rate, its not if we are going to allow told bnodes (which 
>>we currently do).
>
>See above.
>
>>Because the scope of told bnodes isn't a single answer document: it 
>>has to include the entire session
>
>Please show me in the document where we have sessions. I wanted 
>them, but we don't have them.

We don't mention them explicitly, but the framework is designed to 
allow them, without violating the spec as it stands.

>>(ie all the answer documents and queries in the session), which 
>>amounts to it being the scoping graph itself. The entailments that 
>>determine subsumption are then 'inside' this graph (inside a single 
>>existential scope) rather than between separate graphs (different 
>>existential scopes).
>
>Well, I don't think that's the only way.

It seems to me to be almost a direct transcription into logical 
terminology of the idea that a bnode retains its meaning across 
queries.

>
>>>If we take the coreference into account, lines 3 and 4 *together* 
>>>are subsumed by lines 1 and 2.
>>>
>>>So, I'm introducing a notion of entailment here to define 
>>>subsumption, in some sense.
>>
>>No quarrel with that, but you have to be careful about how your 
>>bnodeIDs are scoped. Told bnodes are *the same* between queries,
>
>I really see no evidence of this anywhere in the spec. I was in the 
>believe that the current spec did not allow this or even for it 
>except by not explicitly ruling out.

It allows it. It does not mention it explicitly.

>
>>so you have to include enough of the 'context' of the scoping graph 
>>when you test the subsumptions. The only general way I can see of 
>>doing this is to include it all, as we did in the definition of 
>>E-entailment.
>>
>>>Let's turn an answer set into a graph using the following template:
>>>
>>>Each row gets a fresh bnode.
>>>Each column header gets a uri by a base concated with #variable + 
>>>the variable name
>>>Each value is itself.
>>
>>Im not sure what this translation is supposed to be representing or 
>>what its significance is.
>
>I'm showing how the redundancy of an answer set (in my sense) is 
>detectable by graph redundancy, so I can reduce my notion of answer 
>subsumption to graph entailment. And that shows where I think the 
>scope of the bnodes is.
>
>Just imagine the appropriate construct.

My problem is that I already have what seems like an appropriate 
graph construct for this, and its different from yours. Ill try again 
tomorrow.

>[snip]
>>A question to clarify intuitions, although it is not directly 
>>applicable to SPARQL. Consider the following:
>>
>>:Bill :p :a
>>:Joe :p :a
>>:Bill owl:sameAs :Joe
>>
>>SELECT DISTINCT ?x [?x :p :a]
>>
>>If we are using OWL-DL entailment, should the answer set have two 
>>bindings or one?
>
>Good question and not one that I've seen asked or dealt with. But I 
>think there's a mild UNA wrt answer set redundancy implicit in most 
>treatments of query answering against DLs. Or perhaps, no one has 
>thought hard about distinct.
>
>Thinking about it, and how we handle these things in general, I'd 
>prefer 1 answer, but of the following form:
>
>	?x
>	:Bill = :Joe (or :Bill/:Joe, or some such).
>
>But I wouldn't want two answers. Because there aren't two *answers*.

Ah, I'm glad I asked the question, this is what I thought you might 
say. Seems obvious to me that that there are two *answers* here. That 
they co-refer is interesting, but it doesn't make them into the same 
*answer*. An answer is a binding to a variable, not what that binding 
denotes. Clearly our intuitions differ on this.

>
>Actually, I'd use that for without the projection. Consider:
>
>	:Bill :r1 :a.
>	:Joe :r2 :a.
>	:Bill owl:sameAs :Joe
>
>SELECT DISTINCT ?x ?p {?x ?p :a}
>	?x			?p
>	:Bill = :Joe	:r1
>	:Bill = :Joe	:r2
>
>But I've not though it through.
>
>>An OWL reasoner should be able to figure out that there is only one 
>>entity here in all models of the data, so surely being told two 
>>names for this one guy is a semantic redundancy, no?
>
>It is.
>
>>(Why not?)
>
>Well, I didn't say it was not :)
>
>>And yet I think that any useful query regime should give both 
>>answer bindings, simply because the URIs are distinct.
>
>Well see above :)
>
>>The querying agent might not know that they are coreferential, so 
>>to be told only one of the names might withhold valuable 
>>information.
>
>Probably not, since you could always ask for all the instances :Bill 
>is sameas, in all OWL systems I know. The problem is that picking 
>which name is arbitrary, so could fluctuate without change to the 
>data.

Thats another point, but one that we could fix in some arbitrary way 
(alphabetic ordering, say).

>>Is this case fundamentally different from the above examples of 
>>coreference/redundancy between a bnode and a URI? Why?
>
>It's just redundancy. I use "coreference" only for appearance of the 
>same bnode in different answers. Sorry for that confusion.
>
>I think in the case of the open world assumption (and names), 
>DISTINCT has to be something like, wrt to equality, "as far as I 
>know distinct".

Hmm, I think it should be "as far as I know, could be distinct", ie 
"not known to be the same". That is, I would put the onus on the need 
to establish identity rather than to establish non-identity. (I guess 
thats the 'weak UNA' you re referring to above?) Maybe that is where 
all our differences come from on this. (With your formulation, you 
don't need the 'as far as', since if you know they are distinct, then 
finding out more isnt going to make you discover they are the same.)

Pat
-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 17 August 2006 09:29:53 UTC