Comments about the semantics of property paths from jorge perez on 2010-10-29 (public-rdf-dawg-comments@w3.org from October 2010)

From: jorge perez <jorge.perez.rojas@gmail.com>
Date: Fri, 29 Oct 2010 15:06:55 -0300
To: public-rdf-dawg-comments@w3.org
Message-ID: <AANLkTinrCQywBzNkG2p=ACT4wFCeLC5=3JStBTwsKGGh@mail.gmail.com>
Dear DAWG members:

I have some observations regarding the definition and evaluation of
Property Paths in the last version of the SPARQL 1.1 Query document. I
haven't followed the discussions thus I apologize if these issues have
been raised before.

My main concern is efficiency of evaluation when counting paths in the
evaluation of property paths. In the current document, the semantics
of some path expressions is given by a translation to the SPARQL
Algebra and then the evaluation of paths can lead to duplicate
answers. For example, consider the graph

:a :p :b
:a :p :c
:b :p :d
:c :p :d

and the path query

:a :p/:p ?X

This query is translated into

:a :p ?V . ?V :p ?X

which produces the set of answers { {?V --> b, ?X --> d}, {?V --> c,
?X --> d} }. Since only the variable ?X is in the original query, then
variable ?V should be discarded, and then the final solution contains
two copies of the mapping {?X --> d}. Thus, the evaluation is actually
"counting" the number of paths from :a to :d that satisfy the path
expression :p/:p. Similarly, the path operator "|" is translated into
UNION, and UNION may generate duplicates.

As I show below even for very simple cases, any implementation would
have an exponential blow-up when counting paths.

I don't know whether there is a use case that states the need of
counting paths.
If the number of paths is not important, then one can define a very
simple semantics for Property Paths that can be efficiently evaluated.
Moreover, when paths are not counted there is no need to detect cycles
(which are separately handled for expressions of the form "path+" in
the current document).

Let me elaborate on these issues. Consider the following RDF graph
with 2n triples

G1
---
:a0  :p  :a1
:a0  :r  :a1

:a1  :p  :a2
:a1  :r  :a2

.... // repeat n times

:a(n-1)  :p  :an
:a(n-1)  :r  :an
---

and the property path query

:a0 (:p | :r)/(:p | :r)/...n times... /(:p | :r) ?X

Notice that both the RDF graph and the query are of size O(n). When
evaluating this query one obtains 2^n copies of the mapping {?X -->
:an}. Thus only to write the answer, an implementation would need an
exponential amount of space. The problem is even more evident when one
uses a property path like

:a0 (:p | :r){n} ?X

This is a query of size O(log n) but the answer is still of
exponential size! (2^n copies of {?X --> :an}). This two examples show
that counting paths would dramatically affect the evaluation of
patterns.

There is another issue on counting paths and how arbitrary length
paths are evaluated according to the specification. Consider the graph

G2
----
:a0 :p :a1
:a0 :r :a1
----

When evaluating the query

:a0 (:p | :r) :a1

one obtains two copies of the empty mapping {}. Given the structure of
G2, one would expect that the evaluation of the following query over
G2 produces the same set of answers

:a0 (:p | :r)+ :a1

Nevertheless, if one looks at the definition of procedure
ArbitraryLengthPath, it says that the result is { {} }, that is, a
single copy of the empty mapping (the empty mapping with cardinality
1). In my opinion this is an error in the specification since both
queries above should give the same result when they are evaluated over
G2. If paths are counted, then they both should return two copies of
the empty mapping. On the other hand, if paths are not counted then
both queries should return a single copy of the empty mapping.

I was also trying to compare the result of the following two queries over G2

:a0 (:p | :r) ?X

and

:a0 (:p | :r)+ ?X

I know that the first one returns two copies of the mapping {?X -->
:a1}, but I was not able to follow the APL algorithm for the second
query since I am not sure what "Path(x, path, y)" means, and also
because I am not sure whether "union" inside the "for" removes
duplicates or not. How APL works in this case?

The main question here is whether an expression of the form (:p | :r)+
counts paths or not. Consider again the graph G1 and these two queries

Q1
---
:a0 (:p | :r){1,n} ?X
---

and

Q2
---
:a0 (:p | :r)+ ?X
---

If the query Q2 counts paths, then this is an example of a query of
*fixed size* that produces an exponential number of solutions. If this
query does not count paths, then the specification is inconsistent
since over G1 both queries should give the same result.

A final consideration. When counting paths, the occurrence of cycles
in the graph can lead to an infinite number of different paths. Thus
one has to "avoid cycles" to count and, thus, count only what are
called "simple paths". I am not sure whether the current way of
evaluating arbitrary length paths in the specification is correctly
avoiding cycles. Or it may be the case that I am not understanding
well the notion of cycle that is used in the specification. For
example, if one has a (triangle) graph like

G3
---
:a :p :b
:b :p :c
:c :p :a
---

and the query

:a (:p/:p)+ ?X

should the mapping {?X --> :b} be part of the evaluation? Notice that
in order to obtain {?X --> :b} as a solution one should complete a
cycle in G3. I think that the APL algorithm in the specification
returns :b as a possible solution, thus, what is the exact meaning of
avoiding cycles in this case? I think that the specification should
give a precise definition of the meaning of avoiding cycles and not an
algorithm that implements how to avoid cycles.

It should also be noticed that the problem of avoiding cycles is very
complex: testing whether there exists a "simple path" (a path without
cycles) conforming to a given regular expression between two nodes is
in general NP-complete (see
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.5627
Theorem 2.1).
On the other hand, if paths are not counted then cycles are not a
problem at all.

Summary:
--------
I have shown that counting paths can lead to several issues, the most
important from my point of view, the issue of efficiency of
evaluation.

Thus, I would consider the possibility of not counting paths when
evaluating property paths. In this case, the evaluation of a
property-path pattern like

:a path-exp ?X

should return all the resources :b such that *there exists* a path
conforming to path-exp from :a to :b, and every mapping in the
evaluation should have cardinality 1. It is well known that evaluating
regular expressions over graph databases in this way can be done
efficiently (in linear time with respect to the size of the graph and
the size of the expression).

As a separate but very important issue, notice that the XPath language
does not consider duplicate paths when evaluating expressions (XPath
is evaluated in the "there exists" way that I mentioned before). Thus,
counting paths in SPARQL would be somewhat in contradiction with
previously proposed path languages considered by the W3C.

Please let me know what do you think about not counting paths and the
other issues that I have raised. If DAWG considers the possibility of
having a semantics for property paths that does not consider
duplicates, then I can work on a formalization for it that the group
can use as input for the document.

I hope my comments are helpful.

Best Regards,
- jorge
Received on Saturday, 30 October 2010 15:51:13 UTC