# Definitions for option 6 (part 2 of 2)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Fri, 30 Mar 2012 10:32:04 +0100
Message-ID: <4F757D94.8000203@epimorphics.com>
To: SPARQL Working Group <public-rdf-dawg@w3.org>
```== Option 6

The path operators are:

6.A:   /, |, ! as there are in 2LC.
6.B:   *, +, ? are non-counting

No DISTINCT, No {} forms: {n}, {n,m}, {n,}, {,m}

== Definitions

Of all those, the critical one is *. Given that, the machinery used to
define it and the existing transformation rules, the major item is the
definition of * (ZeroOrMorePath/distinct) here.  The description below
is not completely formal.

An arbitrary length path P = (X (path)* Y) is all solutions from
X to Y by repeated use of path.  It collects the set of nodes visited by
repeated applying path.

ZeroOrMorePath includes X.
OneOrMorePath starts after one application of path.

First, we define a function that returns the set of end points reached
by using a path expression N times.  Then we use that to define path* as
the bindings caused by any number of applications of the path.

== Auxiliary definition: PathPoints

The "path points" of a path expression P and start point X (an RDF term)
is the set of nodes reached by exactly N repeated applications of P.

These two definitions are equivalent:

Definition 1 (recursive):

PathPoints(X, P, 0) =
{ X }
PathPoints(X, P, N) =
{ y | N>0 & Exists z in PathPoints(X, P, N-1) such that z P y }

Definition 2 (fill in the dots):

PathPoints(X, P, 0) = { X }

PathPoints(X, P, N) =
SELECT DISTINCT ?V
WHERE { X P ?V1 .     # N patterns with path P in them.
?V1 P ?V2 .
....
?V(N-1) P ?V
}

They are equivalent because they define sets.  If you directly
implemented like this, definition 1 is likely faster as the size
increases.  It reduces intermediate results to distinct as it goes along
because it used the set answer to N-1.  Definition 2 leaves all that to
end unless the optimizer is smart enough.

== Example

:x :p :y .
:y :q :z .

path:  :p/:q

gives:
PathPoints(:x, :p/:q, 1) = { :z }

and not :y because :p/:q does have :y as an endpoint.

== Main Definitions

** Definition: ZeroOrMorePath

This defines ZeroOrMorePath to be the set of nodes reachable by any
number of applications of a path expression.

Let X be an RDF Node, ?V a variable and P a path expression.
Then

ZeroOrMorePath(X, P, ?V) =
{ ?V=y | y in PathPoints(X, P, N) for some N >= 0 }

Now the case of two variables for end points: consider all nodes in the
graph:

Let V1 and V2 be variables:

ZeroOrMorePath(?V1, P, ?V2) =
{ (?V1=z,?V2=y) | z in nodes(G) and y in ZeroOrMorePath(z, P, ?V2) }

and for a path from a variable to a fixed node:

ZeroOrMorePath(V, P, X)
= ZeroOrMorePath(X, reverse P, V)
= { ?V=z | z in nodes(G) and exists y in ZeroOrMorePath(z, P, y) }

and for a path between two fixed ends:

ZeroOrMorePath(X, P, Y)
= { {} } if ZeroOrMorePath(X, P, ?V) contains ?V=y
= { }    if ZeroOrMorePath(X, P, ?V) does not contains ?V=y

Notes: More identities:

ZeroOrMorePath(x, P, V) =
Projection of ZeroOrMorePath(V1, P, V2) of first of pair to a set
= { a | (a,b) in ZeroOrMorePath(V1, P, V2) }

ZeroOrMorePath(V, P, z) =
Projection of ZeroOrMorePath(V1, P, V2) of second of pair to a set
= { b | (a,b) in ZeroOrMorePath(V1, P, V2) }

** Definition: OneOrMorePath

OneOrMorePath(X, P, ?V) =
{ ?V=y | y in PathPoints(X, P, N) for some N >= 1 }

etc etc

== Non-normative text to help implementers

We can also provide a (now informative) function to calculate the
reachable points from x, some term.  This the result of PathPoints for
any N.  It's the same as the function ALP in 2LC with the last line
removed, and using the result set as the set of nodes visited.

# R is the overall result and the visited set

ALP_1(x:term, path, R:set of RDF terms) =
if ( x in R ) return
X = eval(x,path)
For n:term in X
ALP_1(n, path, R)
End
The result is in R.

For comparison, here is function ALP from 2LC:

ALP(x:term, path, R:multiset of RDF terms , V:set of RDF terms) =
if ( x in V ) return