RDF semantics redux from Pat Hayes on 2005-11-08 (public-rdf-dawg@w3.org from October to December 2005)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 8 Nov 2005 01:17:41 -0600
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230901bf95d666a452@[192.168.2.2]>
This is an attempt to summarize the issues/alternatives currently in 
play regarding what is called the 'RDF semantics' issue. In fact, 
this is turning out to not really be about semantics at all, but 
about the scope of bnodes.

We could decide that bnodes are all locally scoped to graphs and to 
queries, that a bnode in a query has the exact same semantics as a 
pattern variable but does not produce an answer binding (or 
alternatively, that bnodes are simply not used in patterns, only 
variables), and that any bnodes bound to variables in an answer are 
understood to be existentials scoped to that answer only, and have no 
relationship to any bnodes in other queries, in the graph, or in 
other answers. Call this the local option. This is what the spec says 
now. Some comments have regretted this restriction: 
http://lists.w3.org/Archives/Public/public-swbp-wg/2005Nov/0036

With the local option, we can define an answer as a substitution S such that

G entails Q[S]

where 'entails' is a parameter. Putting [entails] = [simply entails] 
gives us the current design, but we can replace this with any other 
entailment and the basic definitions should still work; and this is 
so simple that we can make all the obvious points about RDFS 
entailment being simple entailment form RDFS closures, etc..  And 
just phrasing the definition in this way provides a smooth segway to 
future extensions.

All the interest, however, lies in finding a way to allow bnodes to 
be re-used between queries while keeping the intuitive clarity of 
this 'semantic' definition. Some users seem to want another option, 
call it the ID option. This is where a bnode is supplied as an answer 
binding, and a subsequent query is allowed to ask for more 
information about that thing, using the same identifier. Example:

Graph:
{:Arthur a Person
:Arthur :siblings _:l34
_:l34 type collection
_:l34 first _:P55
_:l34 rest _:l35
_:l35 first :Susan
_:l35 rest _:l36
_:l36 first :Bill
_:l36 rest nil
_:P55 a Person
_:P55 :gender male}

Query 1, to discover whether Arthur has any siblings:
:Arthur :siblings ?L
?L first ?V

In the local option should give, say,

L/_:a, V/_:b

where the nodeIDs are local to the answer. We can take this no 
further: it amounts to the answer "yes", with no further details. To 
get more, we have to compose a more elaborate new query.

In the ID option, the server can choose to provide meaningful 
bnodeIDs as answer bindings:

L/_:l34, V/_:P55

allowing a subsequent query to ask for more information about these things, eg

Query 2:
_:P55 :gender ?S
_:l34 rest _:x
_:x first ?W

giving S/male, W/:Susan.

In order to do this, it seems we need to provide a way for the query 
to distinguish bnodes which are intended to co-refer with bnodes in 
earlier query bindings (and earlier queries) and those, like _:x in 
query 2, which are not; for if the same understanding were applied to 
_:x as to _:P55, then there would be no answer bindings for query 2. 
Enrico's term is that _:l34 and _:P55 here are told-bnodes, while _:x 
is just a bnode.

If we want to support some such mechanism, we need a way to 
distinguish told-bnodes from mere bnodes. Several options have been 
discussed.

(A) some universally agreed format is decided for bnodeIDs in answer 
bindings which are intended to be useable as told-bnodes, and queries 
distinguish told-bnodes by re-using the supplied bnodeID; any other 
bnodes are simple bnodes. So for example, if the convention for a 
told-bnode ID is that is has the prefix _:**, then the above example 
would look like:

Query 1
:Arthur :siblings ?L
?L first ?V
answer
L/_:**l34, V/_:**P55

Query 2
_:**P55 :gender ?S
_:**l34 rest _:x
_:x first ?W
answer
S/male, W/:Susan

but for example

Query 3
_:**l34 rest _:**x
_:**x first ?V
answer
<none>
because there is no bnodeID '_:**x' in G

(B) a told-bnode is indicated by a query pattern variable with a 
bnodeID 'attached' to it, indicating that the variable must be 
replaced by that particular bnodeID. Other bnodes are simply bnodes. 
Answer bindings are not supplied for these variables, as they would 
be redundant. Then the example might be

Query 1
:Arthur :siblings ?L
?L first ?V
answer
L/_:l34, V/_:P55

Query 2
?X[_:P55] :gender ?S
?Z[_:l34] rest _:x
_:x first ?W
answer
S/male, W/:Susan

Notational variations are possible, eg a query variable whose name is 
a bnodeID, so ?_:P55 rather than ?X[_:P55] to indicate a told-bnode.

(C) told-bnodes are indicated by a special kind of answer binding: 
any told-bnode supplied by such a binding is a told-bnode in a 
subsequent query, other bnodes aren't. So the example looks like 
this, where [* *] is how the special answer bindings are indicated 
(not a serious notational suggestion):

Query 1
:Arthur :siblings ?L
?L first ?V
answer
L/[*_:l34*], V/[*_:P55*]

Query 2
_:P55 :gender ?S
_:l34 rest _:x
_:x first ?W
answer
S/male, W/:Susan

The trouble with all of these is that they require the simple 
semantic condition to be re-stated in a more complicated way. We have 
to define (or tweak definitions so that things work out this way) a 
new operation on graphs, something half-way between a merge and a 
union. Merging graphs keeps all the bnodes separate: unioning them 
does no separation at all. What we want is to union on the 
told-bnodes but merge on the other bnodes. Call that semi-merging, or 
smerging. Formally, G smerge H is the graph (G union H[S]), where S 
is a 1:1 bnode-to-bnode mapping whose domain is the non-told-bnodes 
in H and whose range is disjoint from the set of bnodes in G. Then 
the semantic condition is

G entails (G smerge Q[S])

Of course this only makes sense when we have a clear definition of 
what a told bnode is. If Q has no told-bnodes in it, then this is the 
same as

G entails (G merge Q[S])

which is equivalent to the simple condition for the local option.

You can tweak things so as to make things work out without making 
this new definition.
Enrico has done it (several times :-), but it seems  complicated, 
however you phrase it. One way is to describe it in terms of first 
replacing the bnodeIDs with URIrefs, ie skolemizing the told-bnodes 
in the graph and query, then applying entailment, then de-skolemizing 
the answer bindings by replacing the skolemized URIrefs by bnodes 
again. Blech. Another way is to re-state the basic condition as in 
Enrico's document, as the condition

G entails (G merge Q)[S]

which means, when you unpack it, that if you take the bnodes in the 
query, standardize them apart from those in the graph, and THEN apply 
the answer binding (which leaves any bnodes that get introduced at 
that stage alone, so they don't get standardized apart from the 
bnodes in the graph) then what you get is entailed by the graph you 
started with. This only works for the query style (B) above: you have 
to tweak the definition differently for any other way of indicating 
told bnodes.

But, I have to say, all this seems like overkill. The same effect can 
be got by sticking to the local option, and requiring the server to 
supplying a special URIref as an answer binding to indicate an 
'anonymous' entity, along the lines suggested by Andy in

http://lists.w3.org/Archives/Public/public-rdf-dawg/2005OctDec/0159.html.

This corresponds directly to Enrico's original description of this as 
a form of Skolemization, but doesnt require the ugly final step of 
de-skolemizing to get back to a bnode. Then we don't need to invent a 
new graph operation, we don't need any other new bnode conventions, 
and we can stick with the simple, direct (and kind of obvious) 
semantic definition of answer.

Since this is all about allowing names to be used with a wider scope 
than a graph, and since the RDF model has until now consistently kept 
bnode IDs as graph-local identifiers, the only sensible option seems 
to be to provide a new kind of name. Logically there is very little 
to choose between a name and an existential variable, i.e. a bnode, 
in any case. The only difference is the scope. So inventing a URI 
namespace to be the names of 'nameless things' seems like quite a 
clever way around the difficulty, especially if some 
easy-to-implement convention can be provided to map these IDs to and 
from bnodeIDs.

This option has the great merit of semantic simplicity and not 
requiring any special re-interpretation of bnodes or making any 
special effort to re-define the notion of an answer binding. Also it 
puts the control on the server side, which is where it belongs IMO, 
since it doesn't make sense for a query to be able to use a bnodeID 
as a told bnode in an initial query. On balance, I think this is the 
best way to proceed at present. It means that semantically we can 
stick to the local option, which keeps the description simple.

So, my vote is to reply to points like those expressed in

http://lists.w3.org/Archives/Public/public-swbp-wg/2005Nov/0036

by suggesting that 'fake URIs' be used to communicate between server 
and client in cases like this, rather than 'fake variables' in (B) 
above. Real live systems can adopt their own conventions for faking 
the URIs, and still be considered conformant. They can even use real 
bnodeIDs, I would suggest, but if they do then it is up to them to 
keep track of the 'toldness' of these IDs somehow: they can't be 
treated as normal bnodes, but must act 'locally' in such a query 
regime as though they were URIrefs. They have to be 'frozen' into a 
temporary URIref status as far as determining appropriate answers is 
concerned. So as far as the semantics is concerned, we still only 
have two classes of name: those with scope in a graph, and those with 
a wide scope. Any scheme which requires distinguishing two 'kinds' of 
bnode in a query amounts to having three categories of identifier in 
an RDF graph rather than two, all with different scoping rules. This 
muddles the semantics and will likely have knock-on effects 
throughout the algebra. On mature reflection, I think that we do not 
want to go there.

If we agree to this, then Enrico's ingenious work will have been 
wasted, but those are the WG breaks :-). On the plus side, the RDF 
semantics issue will then be trivially solvable since semantically 
the entire transaction will fit within the local option.

Sorry this has taken so long. It needs a WG decision, however, before 
we can move forwards. The semantics issue is hostage to the 
told-bnode decision.

Pat






-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 8 November 2005 07:18:21 UTC