Re: bnodes as answer bindings from Bijan Parsia on 2006-08-03 (public-rdf-dawg@w3.org from July to September 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Thu, 3 Aug 2006 19:18:40 +0100
To: Pat Hayes <phayes@ihmc.us>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>, Eric Prud'hommeaux <eric@w3.org>, Enrico Franconi <franconi@inf.unibz.it>
Message-Id: <E9CCC601-EC81-4AD8-81E0-C8872EC045AB@cs.man.ac.uk>
I think you're confused too, Pat.

On Aug 1, 2006, at 10:33 PM, Pat Hayes wrote:

> There have been rather strong claims made in recent emails about  
> how bnodes in query patterns should (must?) be handled, in some  
> cases accompanied by inappropriate condescending ad-hominem comments.

Is this a case of an *appropriate* condescending ad-hominem comment? :)

But I shall say no more about that.

[snip]
> (2) There is an objection to this comfortable picture, which is  
> that for more expressive languages (OWL-DL), it is possible for a  
> KB to entail an existential statement without there being any term  
> in the KB for a variable to bind to; and in such a case, the  
> natural way to account for this situation would be for the variable- 
> pattern query to fail, but the corresponding bnode-pattern query to  
> succeed.

I don't know that it is an *objection*. Merely a distinction.

> This would therefore require that the two kinds of pattern be given  
> fundamentally different semantics: intuitively, one might say that  
> a bnode means 'some thing exists', but a query variable means 'some  
> named thing exists', where the presence of the name is what makes  
> the answer binding possible; and a bnodeID does not count as a  
> 'name' in the required sense.

I'll note that restricting the domain of bindings (and thus of  
correct answers) is significant even if you don't return  
"existential" bindings. That is, in general, queries with only  
distinguished variables (or, as it's sometimes said, when variables  
are ranging over the "active domain) are generally easier to answer.  
Even in RDF, considering only simple entailment, this is true (as is  
shown by the work we need to do to control redundancy).

> (3) This idea is given terminological flesh by the distinction,  
> introduced in http://lists.w3.org/Archives/Public/public-rdf-dawg/ 
> 2006AprJun/0192
> between 'distinguished' and 'undistinguished' variables, and the  
> observation that SPARQL currently uses 'semi-distinguished'  
> variables (which can bind to both "names" and bnodeIDs). In http:// 
> lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0018
> this distinction is justified on the grounds that bnodes are not  
> "names", so that it would not be appropriate to supply a bnodeID as  
> an answer binding to a query variable; or at any rate, as is well- 
> known to the cognoscenti, this is simply not the way that things  
> are done:

Maybe *this* is the appropriate condescending comment?

> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0031.

Do you *deny* that semi-distinguished variables are novel? Note that  
I've never argued against them. I think that BNodes in answers can be  
very useful for expressing coreference between answers. But they are  
novel.

> (4) None of the reasons cited in the previous paragraph seem to me  
> to be convincing objections to the obvious reply to the objection  
> in (2), which is that bnodeIDs *can* be given as bindings to query  
> variables;

Since I don't see that they were given as objections to all that, I  
fail to see why you might have *expected* them to be convincing.

In that debate, the main point was 1) explicating the behavior of  
Pellet and 2) pointing out that whether the variables are semi- 
distinguished or not is a function of the way that the semantic  
framework of sparql is instantiated. If there are no BNodes in the  
scoping set, perforce the query variables are distinguished. There  
*IS NO SPECIFIED SPARQL SEMANTIC FOR OWL DL*, and so Pellet  
implements the behavior that is most common in the DL community, and  
those that are best understood. BNodes are always nondistinguished in  
SPARQL.

(Racer and KAON2 support only distinguished variables. So if we are  
to describe the current STATE OF THE ART by looking at *existing  
implementation*, then we have to at least *support* the notion of  
distinguished variable. Having only distinguished variables makes a  
query *MUCH* easier to answer.)

> and since this is so, any engine which can determine the entailment  
> of a pattern with a bnode, could deliver such a binding as an  
> answer to the similar query pattern with a variable in place of the  
> bnode. That is, if ASK PATTERN[_:x] (with _:x scoped to the  
> pattern) is entailed, then SELECT ?x PATTERN[?x] should succeed,  
> possibly with ?x bound to some suitable bnodeID generated by the  
> querying engine.  Adopting this as a design principle for querying  
> languages in the RDF family would, I suggest, bring some overall  
> regularity and simplicity to the spectrum of SPARQL versions

At some other costs.

> (with different values for the entailment and scoping set  
> parameters.) The fact that bnodes exist in the language means that  
> purely existential answers can indeed be given by the same variable- 
> binding mechanism as is used to deliver name bindings. In other  
> words, there should be no difference between 'a thing exists such  
> that...' and 'a named thing exists such that...', since the "name"  
> can itself be a bnodeID, if necessary one generated for the sole  
> purpose of being a binding for the query variable.

No one denied that this is possible, and that having  
semidistinguished variables can be both interesting and useful.  
Indeed, as I said one of my posts you sneered at. But forcing people  
to always check for unnamed bindings makes query answering *much  
much* harder. It's not even known if it's decidable in full  
generality for OWL DL.

> Let me call this the 'existential name principle', as it amounts to  
> the claim that the RDF language family makes no distinction between  
> existence and named existence; since if something exists, the  
> presence of bnodes means that we can, in effect, validly generate a  
> name for it.

I was going to refute this, but I think I'll just laugh at it.

> In the rest of this message I will defend this idea against the  
> objections raised in the above messages, and try to distinguish  
> this issue from others with which it might be confused.
>
> (5) One objection to this idea, cited earlier, is that it embodies  
> a logical confusion: bnodes are existentially bound *variables*,  
> and are therefore not logical names. Now, while it is of course  
> true that bnodes and URIrefs/literals (which I will collectively  
> call 'names' here) have subtly different truth conditions, they are  
> not so different as this contrast suggests. First, calling one a  
> 'name' and the other a 'variable' is merely a matter of surface  
> terminology.

Simple entailment shows that this is false.

> Many logical syntaxes make a lexical distinction but this is not  
> necessary, and many others, such as ISO Common Logic, do not: a  
> 'variable' then is simply a name that is bound by a quantifier.

That's quite a difference!

> Second, more to the point, they have essentially the same  
> semantics: they both in effect assert that something exists, and  
> use a lexical token to refer to it.

Well, one is a singular term and the other isn't. Big difference to me.

> (It is worth noting that there are logical syntactic conventions  
> which do not use lexical tokens *of any kind* to show existence,  
> such as Peirce's existential graphs.) The only difference is that  
> the scope of a name token might extend beyond the local syntactic  
> context, allowing other assertions to be made about the thing  
> denoted by the name; whereas an existential 'variable' is a lexical  
> token only within the context defined by the scope of the  
> quantifier. In natural deduction logical systems, logical names and  
> existentials are almost interconvertible, using the rules EI and EG:
>
> PHI[name]
> ----------------EG
> (exists (x)PHI[x])
>
> (exists (x)PHI[x])
> --------------------EI
> PHI[name*]
> when name* does not occur free in the derivation, ie is a 'new' name.

And boy, does that make a HUGE DIFFERENCE.

Ok, I stop now.

Aside from this logical sophistry on Pat's part, the brute fact is  
that, even aside from the logical and computational differences,  
BNodes are pragmatically quite different. Try reusing one from an  
answer set in a fresh query. Try pasting one into the address bar of  
a web browser.

[snip a bizarre account of BNodes.]
>
> (6) Another objection seems to be that to allow variables to bind  
> to bnodes somehow introduces great additional complexity to the  
> query-answering process. It seems to me that this must be wrong,

First, the core point is that having any level of non-distinguishness  
most certainly adds to the complexity. Whether fully non- 
distinguished or semi-distinguished, it's just much harder than  
having only fully distinguished variables. For example, it's not  
known if undistinguished variables in the query, plus transitive  
roles in the query, is decidable for OWL DL. Many bright people for  
several years have been working on that. In the context of rules,  
this is very clear, SHOIQ + DL Safe rules (SWRL rules where the  
variables are restricted to distinguished variables) are decidable,  
whereas SWRL is not. Open defaults when combined with a DL may be  
undecidable, whereas open defaults with the variables treated as  
distinguished are (fairly robustly) decidable. I'll let you google  
for DL Safe rules, but here's the defaults link:
	http://citeseer.ist.psu.edu/baader95embedding.html

So, *never* allowing the variables to be purely distinguished makes  
queries harder to answer even when we know how to answer them. This  
is why racer only does distinguished variables. Some optimizations  
are very hard to work for nondistinguished variables. I can speak  
from personal (and Evren's) experience that having only distinguished  
variables makes things much easier. Heck they make it easier even in  
RDF as it avoids the need, on the one hand if you allow full simple  
entailment (i.e., infinite bnodes in the scoping set) having to  
minimize or leanify, or, on the other hand, having to do the extra  
work the semantic framework does to ensure restricting answers to  
told redundancy. If DISTINCT is interpreted to eliminate told  
redundancy (which I would think it should) then having BNodes in  
answers makes things harder.

> because one can define the variable-binding option in terms of the  
> other. Suppose that a query engine is capable of generating answers  
> to ASK queries containing bnodes, then it can deliver appropriate  
> answers to SELECT queries by the following pseudocode, where  
> (gensymBnode) generates a new bnodeID and P[a/b] means the pattern  
> P with a replaced by b:
>
> if (  ASK P[(gensymBnode)/?x] ) then return (bind ?x (gensymBnode))
>
> Of course, this needs to be written more carefully to account for  
> multiple variables, but you get the idea.

I get the idea that you don't know what you're talking about, or, at  
least, haven't thought it through. This would eliminate coreference  
*between answers* which is the coolest feature of semi-distinguished  
variables, but has a cost. Oh right, you've been confused on the  
scope of BNodes in answers before, as I recall.

And accounting for coreference *within* a answer across distinct  
variables is also non-trivial, and different from an ASK query. (For  
one, if you have two answers, one with BNode cross reference between  
distinct variables and one without, you have one more answer than for  
just the ASK query. When each answer has complexity NEXPTIME or much  
worse, well, practically speaking that can be a pain.)

> And as written this does require doing an ASK for each pattern of  
> variables in the query; but this is at worst a small multiple  
> linear extra cost, and need only be done in cases where a real  
> SELECT query on the variable has failed; and I am sure a more  
> efficient method can be defined in practice, since the derivation  
> of the inferences supporting the existential query will in any case  
> involve generating suitable 'names'.

The devil is in that "suitable", and you haven't thought it through.  
Again.

>  (I can kind of see that in the presence of elaborate equality  
> reasoning, problems might arise in determining whether or not  
> multiple bindings are really the same. But even here, the  
> possibility of providing a mere existential name - a bnode - does  
> seem to provide an opportunity for computational savings with a  
> possible loss of completeness but without sacrificing correctness.)

A lot of things "seem" a certain way to you, but alas are not the way  
they seem to you.

> (7) Finally, there seems to be a misapprehension that because any  
> graph entails infinitely many other bnode-rich graphs which have it  
> as an instance, that to allow bnodes as bindings will somehow open  
> an infinite floodgate of redundant answers.

If care is not taken, yes. But really the problem is determining what  
the *right* number of answers is, and that's nontrivial. It's *much*  
easier with only distinguished variables.

> And this would be true, if the spec were written in a very naive  
> way so that entailment alone were a sufficient criterion for  
> correctness of an answer binding. But this, of course, is why the  
> SPARQL specs are written in the way that they are, to restrict  
> answer bindings to a particular scoping set of possible bindings -  
> in the case of normative SPARQL, the URIs and Bnodes which actually  
> occur in the scoping graph.

And this is computationally simpler than requiring true semantic  
minimization, but it does mean that DISTINCT needs to be specified  
correctly. If DISTINCT really does mean DISTINCT, then it will be a  
much harder operation when dealing with BNodes.

If we don't allow that, then people have to work in user code to  
figure out which Bnoded answers are actually redundant...which is  
unfortunate in my opinion.

> The separation between the scoping graph and the scoping set was  
> written into the spec to allow for alternative strategies for  
> handling OWL. Bijan notes that the spec as written therefore allows  
> 'OWL-DL-SPARQL' - which, to emphasize, is not itself defined by the  
> spec - to restrict itself to bindiings to an arbitrarily small set  
> of legal answer bindings, and therefore to restrict itself so as to  
> disallow bindings to bnodes. This is technically correct, but  
> wholly against the design intention of those definitions.

Well, since Enrico was involved in the design of those definition,  
and I know that THIS WAS HIS INTENT, you're just dead wrong.

Again.

And really, you're also full of crap to be pulling the "design  
intention" move. How slimy.

> It was never the intent of that construction to restrict the set of  
> legal answer bindings:

Pif and fle.

Actually, it's a *nice feature* of the scoping set that it can allow  
us to define a variety of types of variables, including this new one  
and even allow us to define told redundancy.

Funny thing, Pat, is that I don't want to get rid of semi- 
distinguished variables...you (and eric) want to get rid of  
distinguished variables. But they are useful. Oh, I suppose they'll  
sneak back in with a hidous isBnode() filter hack, but please,  
goodness me, screw that.

> it was, rather, to allow the scoping set to contain *extra*  
> bnodeIDs, precisely in order to overcome the objection raised in  
> the second paragraph above, i.e. to allow the query engine to  
> invent new bnodes when they are required in order to provide an  
> answer binding to a query which has a purely existential answer,  
> for which no bnode which actually occurs in the scoping graph is an  
> appropriate answer binding.
>
> (8) Although the restricted interpretation (where a scoping set for  
> OWL-DL queries must contain no bnodes) does not violate any actual  
> technical requirement of the current SPARQL draft, I would argue  
> strongly against adopting it as a standard, to the point that I  
> would argue for pre-emptively rendering it illegal by suitable  
> wording to the current spec.

The university of manchester will formally object to this move on the  
following grounds:

	1) There is no need to constrain future groups
	2) Distinguished variables are useful (in particular, if you don't  
care about purely existential answers, you should be able to blow  
them off and reap the computational benefits)
	3) Distinguished variables are a standard notion
	4) Distinguished variables are widely implemented (Racer, Kaon2, and  
Pellet, which covers just about *all* the OWL DL query answering  
implementation; which, in point of fact, covers all the OWL query  
answering implementations, in any reasonable sense of the term);
	5) Distinguished variables are computationally and  
implementationally easier.

It's enormously amusing that you, Pat, who so championed the  
implementors point of view before, would now be so hostile to them  
now. Wait, I guess it's not so surprising after all.

> With the restricted interpretation, for example, we would have the  
> absurd situation that answer bindings could be derived from a graph  
> using RDF entailment which are not legal using OWL-DL entailment,  
> even when the graph itself is legal OWL-DL/RDF.

Entailment is not the only factor in defining a query language  
semantics, as we've seen. So this comparison is nonsensical unless  
you fully spell out the semantics. Once you fully spell out the  
semantics it's not absurd.

> (9) A final quick word on the claim of prior art implicit in the  
> wording of some of the earlier messages.

Oh, HERE comes the appropriate condescending ad hominem!

> The entire subject of RDF querying is too new

Hardly. But even if it were, the debate was about DL querying, which  
is not new, and about the distinction between distinguished vs.  
nondistinguished variables, which is also not new.

> for there to be a substantial body of accepted techniques, accepted  
> ideas, etc.., ; so any appeal, explicit or implicit, to the 'way  
> things are done', or to a body of expert knowledge not vouchsafed  
> to us all but weighty in its implications, should be treated as a  
> symptom of blowhardiness; or at the very least in order to be taken  
> seriously needs to be backed up by actual citations and/or arguments.

Pat, if you would like guidance to the literature from those who  
bother to read it and who are known for their contributions to that  
literature, it generally works better to ask them politely, perhaps  
even off list, instead of hurling petulant outbursts of insult.

I know it works better *on me*.

I don't know what surprises me more your hypocrisy, your weaseliness  
in not addressing me directly (though, my goodness, you link to posts  
written by me! maybe you *do* love BNodes! _:x whoWrotePost  
http://... instead of "Bijan"), your inability to follow a discussion  
(substituting "RDF querying" for DL querying and types of variables  
in the last paragraph), or your technical ignorance.

Maybe it's the combination of them all, and that you would chose to  
expose them in public.

Cheers,
Bijan "Not holding his breath for an apology" Parsia.
Received on Thursday, 3 August 2006 18:18:57 UTC