Re: bnodes as answer bindings from Bijan Parsia on 2006-08-07 (public-rdf-dawg@w3.org from July to September 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Mon, 7 Aug 2006 16:56:04 +0100
To: Pat Hayes <phayes@ihmc.us>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <61BB4B1E-EDD3-4C72-85F8-A69622F2C0AE@cs.man.ac.uk>
On Aug 7, 2006, at 4:58 AM, Pat Hayes wrote:

>> On Aug 4, 2006, at 11:11 PM, Enrico Franconi wrote:
>>
>>>
>>>> Can you give references for all this terminology that you cite?  
>>>> What exactly is the "active" domain? There is nothing in any  
>>>> semantic theory that I know of that distinguishes *things in the  
>>>> domain* on the basis of the kind of name that is used to refer  
>>>> to them with. The idea does not make sense, in any case: if  
>>>> bnodes were obliged to refer to a non-active domain while names  
>>>> refer to something else, then the troublesome redundancies would  
>>>> be eliminated.
>>>
>>> The first entry in <http://scholar.google.com/scholar?q=%22active 
>>> +domain%22+database> is a survey in DBs written 20 years ago.
>
> OK, thanks for that. I can't actually get the article on-line from  
> this, and the abstract does not use 'distinguished' or 'active'  
> anywhere. But I will continue to search.

I gave you a link that led to a use of it (in a Racer paper).

I can get to that paper when inside the university network. If you  
are doing if from home, often the library research portal will give  
you a way to log in.

You could also join the ACM ($200/year for full digital library  
access...which, if you don't have it via the university, you are  
missing out; great stuff in there) or at least register for the free  
limited access bit. I'm not sure if the free registration will give  
it to you and it's hard for me to test now that I'm at work.

Here's the key paragraph:

"""When defining an instance of a semantic
schema, an active domain is associated with
each node of the schema. The active
domain of an atomic type holds all objects
of that type that are currently in the data-
base. This notion of active domain is
extended to type constructor nodes below.
erately narrow and differs from the usage
of that term in some models, including
SDM and TAXIS. The representation of
aggregations in those models is generally
based on attributes and is discussed in the
next section. It should also be noted that """

The key notion, I would say, is being "in the database", i.e., *used*  
in the database. Since no elements of the domain get into a database  
without a name, and the ABox portion of a DL kb is  generally  
considered to be analogous to the *data* of a database, you can see  
why the Racer folks use the term. As you can see from the cases of  
Terminological defaults, horn rules, or cyclic queries with  
transitive rules (heck, even not cyclic queries), focusing on the  
active domain can make a lot of difference.

In DL safe rules, the way they formalize it is to have a special  
predicate O which is true of all and only the named individuals in  
the KB. Then you can use appearance in the body of a O(X) where X is  
a variable to get the distinguished/nondistinguished distinction.

>>>> I have never previously heard of this terminology of  
>>>> "distinguished" vs. "nondistinguished". (You have everyone's  
>>>> permission at this point to roll your eyes in amusement at my  
>>>> profound ignorance, of course.) I would be interested to see  
>>>> where this terminology was first used, and what its history is.  
>>>> In a database context where there are no bnodes, the distinction  
>>>> would be vacuous.
>>>
>>> Ah. Second and third entries in <http://scholar.google.com/ 
>>> scholar?q=distinguished%20variables> are DB references from  
>>> almost 30 years ago.
>
> Thanks again. Similar access problems.

If you look at Ian and Sergio's paper in ISWC 2002, you'll see the  
terms used. Plus if you look, for example, on the KAON2 site, you'll  
see this terminology used. (This doesn't give you the history.)

>> And Pat's own acquaintance with some variants of the latter  
>> terminology:
>> 	http://daml.semanticweb.org/listarchive/joint-committee/1024.html
>> 	http://pride.daml.org/listarchive/joint-committee/1125.html
>>
>> and
>> 	http://daml.semanticweb.org/listarchive/joint-committee/1027.html
>
> Indeed, I had forgotten that we did use this terminology at one  
> point in DQL; but we used it with a completely different meaning,  
> which had nothing to do with what the variable is allowed to bind  
> to. (In retrospect, a better term for the DQL notion would have  
> been 'selected variable': it meant only a variable whose binding is  
> returned in an answer.)

This is one aspect of distinguished variables, which corresponds to  
their being in the head. That's what being in the head means. (Roughly.)

> I note that in the DATALOG literature the term is used with yet a  
> third meaning, viz. a variable which occurs in the head.

I believe I pointed this out to you. And I don't think it's exactly a  
third meaning. It coincides with both being in the head and being  
restricted to the active domain. It's just that they coincide in  
Datalog.

In SPARQL, not listing variables in the SELECT clause is a  
projection, which now is distinguishable from making a variable  
nondistinguished. That is, in Datalog, if you project away, it's the  
same as making the variable not appear in the head (since it no long  
appears in the answer set). But since *all* variables always are  
restricted to the active domain, it doesn't change anything.

> Not surprisingly, the phrase "distinguished variable" seems to be  
> used for a variety of cases in which someone wishes to distinguish  
> one variable from another.

Not at all. The distinguished/nondistinguished variable distinction  
is clearly a term of art in databases. When you extend it beyond the  
database context, you can choose to emphasize the "returns a binding"  
and come up with bindings for purely existential answers, or you can  
keep with what I think is part of the spirit of the distinction and  
also restrict them to the active domain. Richard Fikes was clearly  
borrowing from the DB literature, but his use of it has not been  
adopted anywhere that I see.

> This however does not make it a widely used technical term, only a  
> common English phrase.

You are joking of course. First off, this is a straw man. Even if it  
WERE not a *widely used* technical term I  never *said* it was. I  
said it was *standard* which it is both in the database and the  
description logic communities. Since we were discussing a description  
logic answering system, all I really need to do is refer to that.  
It's definitely standard there, and in that community, widely used.

Clearly Enrico and I both know the terms this way. Is it an accident?  
I promise that I didn't learn it from Enrico.

> Apparently, in fact, these various uses - and I am sure one could  
> easily find others - have very little, if anything, in common.

Uh...no. They quite clearly have history in common, and they have a  
substantive overlap in meaning.

>> I believe the most deeply nested quote is Richard Fikes, the next  
>> level Pat, and the final line richard (in spite of the quote mark):
>>
>> """> >answer will include a binding for each distinguished variable.
>> I am
>>>  >referring to the variables in the query pattern that are not
>>>  >distinguished variables as "non-distinguished variables".
>>>
>>>  undistinguished variables?
>>
>>> From a quick check on the Web, I find them being called
>> "nondistinguished variables"."""

See, Richard was getting a standard term!

>> I don't expect Pat to have remembered this. It was, after all, 5  
>> years ago. It seems there is precedent for semi-distinguished  
>> variables in DQL.
>
> This terminology of 'semi-distinguished' is silly.

I  never claimed "semi-" or "quasi-distinguished" was a great name  
for them. But we need some name for them.

> These are simply *variables*, plain vanilla.

No. Variables include distinguished and nondistinguished, at least.  
Since we're defining several different behaviors, it helps not to  
appropriate the general term.

> A variable is a syntactic token whose role is to stand in for, or  
> be replaced by, or be bound to, a piece of syntax so that the  
> resulting expression is well-formed. This notion of variable is  
> used a wide variety of contexts and has been so used for at least  
> 50 years (lambda calculus, substitutional interpretation of  
> quantifiers, production rules):

Citations? Preferably going back 50 years?

That's me being cheeky, btw. In any case, let me try an analogy. Let  
us say you were making a sorted logic where you wanted to also have  
variables which were not sorted, i.e., were not restricted to a type.  
You might call the first kind of variable "sorted" or "typed"  
variables, and the second, "untyped" or "unsorted" variables, even  
though the latter corresponds to the very notion of variable. See, in  
a *context* where one is making certain distinctions, sometimes it  
helps to have a more specialized name for the general concept,  
especially when the generalized concept no longer *quite* covers  
everything.

Note that distinguished variables have two aspects, appearing in the  
head (i.e., appearing in the answer set) *and* ranging over names.  
These two features are related. So "semi" seems appropriate.

> there is nothing new or exotic about it.

In the context of answering queries, it is. The only antecedent that  
I've seen is in that email exchange.

Don't you think it's a bit odd for you to find RDF query (with  
BNodes) so radical and different that supposedly we can saw little  
authoritative about it, on the one hand, and on the other that one of  
the key aspects of that strangeness is supposed to be completely  
pedestrian? Semi-distinguished variables *bind* and *report* that  
binding (even if projected away) over arbitrary elements of the  
domain (or syntactic elements, if you will). Where are the algorithms  
for this? Complexity? Implementations?

Please cite me a paper for this, other than DQL, which deals with  
this. In fact, I don't think that DQL quite captures what we have in  
SPARQL because of the scope of the variables to between answers.

I would review this exchange:
	http://www.daml.org/listarchive/joint-committee/1109.html

Particularly starting here:
	http://www.daml.org/listarchive/joint-committee/1113.html

I found your intuitions as to the answers to such queries as:

"""KB	John rdf:type _:r .
	_:r daml:onProperty friend .
	_:r daml:minCardinality 3 .

Query	John friend ?l .
	?l distinguished"""

vs.

"""KB	John rdf:type _:r .
	_:r daml:onProperty friend .
	_:r daml:minCardinality 3 .
	John friend _:f .

Query	John friend ?l .
	?l distinguished""""

To be *extremely* counterintuitive, given that the KBs are equivalent.

In this message:
	http://www.daml.org/listarchive/joint-committee/1121.html

You basically say that you don't care whether equivalent KBs give the  
same answers. Perhaps you don't think that any longer, but I  
certainly think that for interoperability, we should strive to make  
the answers from different engines to be, at the very least,  
predicable. Frankly, I think that they should be "the same" insofar  
as we can assume that.

BTW, in
	<http://www.daml.org/listarchive/joint-committee/1121.html>

you point out that you were acquainted with the concept (by Ian) of a  
distinguished variable"

"""Ian has argued strongly that it should not
include all 'answers' that can be logically inferred from the KB, but
only those which arise from a binding of a query variable to a term
in the KB Herbrand universe, in order to keep the inferential burden
on the server within DL-manageable bounds. I am happy with that; but
given the resulting incompleteness, it seems silly to object to a
proposal on the grounds that logically equivalent KBs may not always
deliver the same answers."""

As for incompleteness in general, there are *always* more expressive  
queries one could ask. I don't think there is a clear notion of  
"completeness" independent of the semantics of the query. So, I think  
that the complaint about incompleteness is, in general, a red  
herring. If you don't like the *expressiveness* of the query (which I  
think is a more helpful way to think about it), that is a different  
story.

> Both 'distinguished' and 'undistinguished' variables, in the sense  
> you are using these qualifiers, are variables which are restricted  
> in some way to bind only to a certain class of syntactic instances.

Er...sort of. I guess. I don't think that's the happiest way to think  
about them. But ok.

> But to call an variable without any such restrictions applied to it  
> 'semi' anything, and to claim it is something new, is daft.

It also has to appear in the head. I.e., to report back bindings.  
This is new, and there are no developed algorithms for it. And, as I  
pointed out to you, I believe in private email, there are subtlies  
involved in the DL case, if you are going to allow the sort of  
coreference that you have in the RDF case, that are reasonably  
challenging.

Oh, is "daft" permitted in the set of appropriate snarky/insulting  
terms? I mean, if I say that *your* email was "daft", will I get in  
trouble?

> These are just plain *variables*, and this idea is about as old as  
> algebra.

I hope you see why your terminological rant is misguided. We are  
talking about variables in the specific context of query answering.  
We are talking about variables which are neither distinguished nor  
non-distinguished but share a property of each, that is, with  
distinguished of appearing in the head/answer set and with non- 
distinguished of ranging over arbitrary elements of the domain, that  
is, including ones not named in the KB. These are indeed new, and the  
techniques for dealing with them have not been developed.

They are a bit less new in the SPARQL/RDF context (if we ignore  
minimization) because, essentially, we have chosen to treat the  
BNodes in the KB as names, thus as identifying part of the active  
domain. I tend to think this is an abuse of the nature of BNodes as  
existential variables, but 1) it has computational benefits, 2) most  
users do not think of BNodes as existential variables, 3) it comports  
with implementations, 4) we can enable the minimizing behavior with  
"DISTINCT" which is reasonble. 1 with 4 allows us to give a kind of  
SQLy reading, where the "practical language" is allowed to depart  
from the dictates of the relational algebra for computational reasons.

> BTW, in the DQL sense of 'distinguished', "semi-distinguished"  
> would be incoherent. (Do you return half the answer?)

See above for the origin. The "semi-" refers to the  
distinguishedness, not the answers.

To reiterate, I see all three as useful in both the RDF and the DL  
context. Distinguished variables are less useful in the RDF context  
because of the ease of thinking of the BNodes as names. Basically,  
you don't have a very interesting structure to the models induced by  
the BNodes, you pretty much just have to replicate the asserted  
relational structure. They are useful in the context of computing  
redundancies since, if you don't *care* about purely existential  
answers, and you want non-redundant results, it's useful to avoid the  
BNodes. But perhaps that is rare enough in the RDF case to make the  
use of a filter not very painful.

Is there a reason to continue this discussion on list? I mean, it's  
pretty much a debate about the history and use of terms.

Cheers,
Bijan.
Received on Monday, 7 August 2006 15:57:05 UTC