- From: jorge perez <jorge.perez.rojas@gmail.com>
- Date: Fri, 29 Oct 2010 16:11:30 -0300
- To: public-rdf-dawg-comments@w3.org
Dear DAWG members: I have some observations regarding the definition and evaluation of Property Paths in the last version of the SPARQL 1.1 Query document. I haven't followed the discussions thus I apologize if these issues have been raised before. My main concern is efficiency of evaluation when counting paths in the evaluation of property paths. In the current document, the semantics of some path expressions is given by a translation to the SPARQL Algebra and then the evaluation of paths can lead to duplicate answers. For example, consider the graph :a :p :b :a :p :c :b :p :d :c :p :d and the path query :a :p/:p ?X This query is translated into :a :p ?V . ?V :p ?X which produces the set of answers { {?V --> b, ?X --> d}, {?V --> c, ?X --> d} }. Since only the variable ?X is in the original query, then variable ?V should be discarded, and then the final solution contains two copies of the mapping {?X --> d}. Thus, the evaluation is actually "counting" the number of paths from :a to :d that satisfy the path expression :p/:p. Similarly, the path operator "|" is translated into UNION, and UNION may generate duplicates. As I show below even for very simple cases, any implementation would have an exponential blow-up when counting paths. I don't know whether there is a use case that states the need of counting paths. If the number of paths is not important, then one can define a very simple semantics for Property Paths that can be efficiently evaluated. Moreover, when paths are not counted there is no need to detect cycles (which are separately handled for expressions of the form "path+" in the current document). Let me elaborate on these issues. Consider the following RDF graph with 2n triples G1 --- :a0 :p :a1 :a0 :r :a1 :a1 :p :a2 :a1 :r :a2 .... // repeat n times :a(n-1) :p :an :a(n-1) :r :an --- and the property path query :a0 (:p | :r)/(:p | :r)/...n times... /(:p | :r) ?X Notice that both the RDF graph and the query are of size O(n). When evaluating this query one obtains 2^n copies of the mapping {?X --> :an}. Thus only to write the answer, an implementation would need an exponential amount of space. The problem is even more evident when one uses a property path like :a0 (:p | :r){n} ?X This is a query of size O(log n) but the answer is still of exponential size! (2^n copies of {?X --> :an}). This two examples show that counting paths would dramatically affect the evaluation of patterns. There is another issue on counting paths and how arbitrary length paths are evaluated according to the specification. Consider the graph G2 ---- :a0 :p :a1 :a0 :r :a1 ---- When evaluating the query :a0 (:p | :r) :a1 one obtains two copies of the empty mapping {}. Given the structure of G2, one would expect that the evaluation of the following query over G2 produces the same set of answers :a0 (:p | :r)+ :a1 Nevertheless, if one looks at the definition of procedure ArbitraryLengthPath, it says that the result is { {} }, that is, a single copy of the empty mapping (the empty mapping with cardinality 1). In my opinion this is an error in the specification since both queries above should give the same result when they are evaluated over G2. If paths are counted, then they both should return two copies of the empty mapping. On the other hand, if paths are not counted then both queries should return a single copy of the empty mapping. I was also trying to compare the result of the following two queries over G2 :a0 (:p | :r) ?X and :a0 (:p | :r)+ ?X I know that the first one returns two copies of the mapping {?X --> :a1}, but I was not able to follow the APL algorithm for the second query since I am not sure what "Path(x, path, y)" means, and also because I am not sure whether "union" inside the "for" removes duplicates or not. How APL works in this case? The main question here is whether an expression of the form (:p | :r)+ counts paths or not. Consider again the graph G1 and these two queries Q1 --- :a0 (:p | :r){1,n} ?X --- and Q2 --- :a0 (:p | :r)+ ?X --- If the query Q2 counts paths, then this is an example of a query of *fixed size* that produces an exponential number of solutions. If this query does not count paths, then the specification is inconsistent since over G1 both queries should give the same result. A final consideration. When counting paths, the occurrence of cycles in the graph can lead to an infinite number of different paths. Thus one has to "avoid cycles" to count and, thus, count only what are called "simple paths". I am not sure whether the current way of evaluating arbitrary length paths in the specification is correctly avoiding cycles. Or it may be the case that I am not understanding well the notion of cycle that is used in the specification. For example, if one has a (triangle) graph like G3 --- :a :p :b :b :p :c :c :p :a --- and the query :a (:p/:p)+ ?X should the mapping {?X --> :b} be part of the evaluation? Notice that in order to obtain {?X --> :b} as a solution one should complete a cycle in G3. I think that the APL algorithm in the specification returns :b as a possible solution, thus, what is the exact meaning of avoiding cycles in this case? I think that the specification should give a precise definition of the meaning of avoiding cycles and not an algorithm that implements how to avoid cycles. It should also be noticed that the problem of avoiding cycles is very complex: testing whether there exists a "simple path" (a path without cycles) conforming to a given regular expression between two nodes is in general NP-complete (see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.5627 Theorem 2.1). On the other hand, if paths are not counted then cycles are not a problem at all. Summary: -------- I have shown that counting paths can lead to several issues, the most important from my point of view, the issue of efficiency of evaluation. Thus, I would consider the possibility of not counting paths when evaluating property paths. In this case, the evaluation of a property-path pattern like :a path-exp ?X should return all the resources :b such that *there exists* a path conforming to path-exp from :a to :b, and every mapping in the evaluation should have cardinality 1. It is well known that evaluating regular expressions over graph databases in this way can be done efficiently (in linear time with respect to the size of the graph and the size of the expression). As a separate but very important issue, notice that the XPath language does not consider duplicate paths when evaluating expressions (XPath is evaluated in the "there exists" way that I mentioned before). Thus, counting paths in SPARQL would be somewhat in contradiction with previously proposed path languages considered by the W3C. Please let me know what do you think about not counting paths and the other issues that I have raised. If DAWG considers the possibility of having a semantics for property paths that does not consider duplicates, then I can work on a formalization for it that the group can use as input for the document. I hope my comments are helpful. Best Regards, - jorge
Received on Saturday, 30 October 2010 15:45:24 UTC