Fixing BGP matching from Seaborne, Andy on 2006-11-13 (public-rdf-dawg@w3.org from October to December 2006)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 13 Nov 2006 18:25:48 +0000
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <4558B8AC.7000008@hp.com>
We need to fix the definition of BGP matching because the current one has a 
bug where the graph is redundant.

http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0085.html

This is a summary of approaches, gathering together material from all the 
discussions I know about into (roughly) similar terminology.


== Structure

Overall, treat BGP matching as a parameter of SPARQL.  rq24 need only define 
it for simple entailment.  Any other entailment regime can provide it's own 
form of BGP matching appropriate to it's entailment characteristics.  The only 
requirement is that the solutions (mappings) generated bindings for every 
named variable in the BGP.

We could impose on any entailment regime that G |= u(P) as a necessary but not 
sufficient condition (simple entailment).


== Simple Entailment

For simple entailment, the options are:

1/ Define:

   w(P) is a subset of G

for a mapping w of pattern P on graph G to terms of G.  P does not share any 
blank nodes with G.  The mapping w applies to name variables and to blank 
nodes in the BGP; blank nodes behave very much as named variables except there 
is no binding for them; alternatively, they are projected away.  Cardinality 
is one only if the blank nodes are projected and also a distinct applied to 
the pattern matching.  Fortunately, if the whole query is DISTINCT, only the 
query DISTINCT is needed.  Otherwise BGP matching has duplicates.


2/ Similar to 1 except have two mappings, one for blank nodes b(), one for 
named variables u() (see Jorge's "Semantics of SPARQL" sec 5.1).  Now the 
cardinality of the mapping is one always.

   u(b(P)) is a subset of G


3/ Retain the use of an entailment relationship and skolemize the blank nodes 
in the graph.

P does not share any blank nodes with G.  Write f as a skolemization over the 
bnodes in term(G), mapping any occurrences of a bnode in G to a unique URIs. 
Define matching by

   f(G) |= f(u(P))  where |=  is simple entailment.

f is the same on both sides.  u does not map bnodes in P.  f is fixed for the 
BGP match (fixing to be the same for all the query for the same graph gives 
the same results because it's range does not leak out of the BGP match).

A blank node in P behaves as an existential variable.  This is like the 
current definition except that issues to do with bnodes in the scoping set are 
removed from the application of |= by the skolemization.  In other words, 
blank nodes in the graph are "turned off" for the purposes of simple 
entailment in BGP matching.


== Notes

A/ 2 and 3 give the same results.  b() maps blank nodes to terms of G which 
are then matched by identity.  f() produces URIs which are matched by identity 
and neither side of |= involves bnodes.

B/ The difference between 1 and 2/3 is in whether blank nodes cause duplicates 
in BGP matching.  Having distinct(var(w(P))) makes 1=2=3 where var makes it 
clear that w(P) is reduced to the named variables (i.e. set of u(P))

C/ It's not clear to me at the moment that (3) provides much of an upwards 
compatibility path to other entailment regimes unless the skolemization 
process is acceptable to the other entailment regime, and I don't know if it 
is acceptable, universally or just widely.  If it's not a path of any 
compatibility, then it's just a way of writing 2 using the word "entailment" - 
b() captures the simple entailment nature in 2.

D/ It's not clear to me what the LC1 position is on blank nodes in BGPs.  It 
does not consider cardinality so it was a non-issue.


== Background

http://planetmath.org/encyclopedia/Skolemization

http://www.w3.org/TR/rdf-mt/#prf
[[
Skolemization is a syntactic transformation routinely used in automatic 
inference systems in which existential variables are replaced by 'new' 
functions - function names not used elsewhere - applied to any enclosing 
universal variables. In RDF, Skolemization amounts to replacing every blank 
node in a graph by a 'new' name, i.e. a URI reference which is guaranteed to 
not occur anywhere else. In effect, it gives 'arbitrary' names to the 
anonymous entities whose existence was asserted by the use of blank nodes: the 
arbitrariness of the names ensures that nothing can be inferred that would not 
follow from the bare assertion of existence represented by the blank node. 
(Using a literal would not do. Literals are never 'new' in the required sense.)

To be precise, a Skolemization of E (with respect to V) is a ground instance 
of E with respect to V with a 1:1 instance mapping that maps each blank node 
in G to a URI reference that does not appear in G (so the Skolem vocabulary V 
must be disjoint from the vocabulary of E)
]]

 Andy
Received on Monday, 13 November 2006 18:26:14 UTC