- From: W Martens <martens.wim@gmail.com>
- Date: Mon, 20 Feb 2012 14:51:55 +0100
- To: public-rdf-dawg-comments@w3.org
Dear DAWG members, we tried to post the following message on public-rdf-dawg last week, but it doesn't seem to appear. Perhaps it was not the correct mailing list for this kind of comment. We have some comments regarding the semantics of property paths, which give some extra information about the issues already raised by Marcelo Arenas, Sebastian Conca, and Jorge Perez in their post a few weeks ago (http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html). We have noticed that the Working Group is already discussing the issue of property path semantics intensely and we would like to share our findings to aid in the discussion. We performed a mostly foundational study of the current semantics of property paths in a paper which has been accepted at PODS 2012. A preprint is available for download at http://www.theoinf.uni-bayreuth.de/download/pods12submission.pdf. This work was conducted independently from the work by Arenas, Conca, and Perez and it is remarkable that, in many cases, we arrive to very similar conclusions. In what comes next, we will try not to repeat what has been said before, but focus on giving new information. On the current semantics of Property Paths ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The current semantics of property paths in SPARQL 1.1 relies on two choices which have a dramatic impact on the complexity of query evaluation. In the paper, we refer to these as (A) the "simple walk requirement" (B) the necessity of "counting paths" for computing query answers To clarify (A) and (B): By simple walk semantics, we refer to the definitions of ZeroOrMorePath and OneOrMorePath in Section 18.4 in the working draft (SPARQL Algebra). These definitions state that the answers to these operators are obtained by "repeated use of path such that any nodes in the graph are traversed once only". A path that traverses any node in the graph only once (and possibly returns to its first node) is a simple walk. Condition (B) has already been mentioned in the post by Arenas, Conca, and Perez. Implications of the current semantics on Property Path evaluation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each of requirements (A) and (B) make property paths hard to evaluate. More precisely, each of these requirements makes the evaluation problem of property paths NP-hard or #P-hard, already for very, very simple queries. Details are in the paper. Can this be solved? ~~~~~~~~~~~~~~~ We discovered that there could be an alternative semantics for property paths that alleviates the before mentioned efficiency problems. In particular, consider the following conditions, which might be alternatives for (A) and (B): (A') regular path semantics (B') only check the existence of a path Option (A') would be the same than (A), but no longer require that any node in the graph is traversed once only. Option (B') is what [Arenas, Conca, and Perez] refer to as "existential semantics". Our findings regarding the trade-offs between (A) / (A') and (B) / (B') can be summarized as follows: Simple walk semantics vs Regular path semantics ----------------------------------------------- If the default behavior of a SPARQL property path would have conditions (A') and (B'), then our paper gives an algorithm to evaluate SPARQL property paths in polynomial time combined complexity (Theorem 3.2). The algorithm in Theorem 3.2 can be easily adapted to compute all pairs (X,Y) that should be returned by a SELECT X,Y WHERE { X :p Y } query in polynomial time. Notice that this is a *double exponential* improvement wrt. current implementations. Furthermore, :p can be an arbitrary property path, including counters. This means that, for evaluating a property path of the form p[200,300], our algorithm only needs polynomial time in the number of bits needed to represent the number 300. However, this improvement is not possible under requirement (A). Requirement (A) already makes it NP-complete to test whether a given pair (X,Y) should be in the answer of the above SELECT query or not. Therefore, our feeling is that the choice between (A) or (A') is worthwhile reconsidering since (A') could dramatically improve performance. To Count or not? ---------------- >From a user perspective, it makes sense to have the option to choose between (B) or (B'). Sometimes, you like to count duplicates in an answer to a query and sometimes you don't. On this topic, we largely agree with the work of [Arenas, Conca, and Perez]. They suggest to apply (B') as a default semantics to property paths, and to use a keyword in the case that the user desires to have duplicates in the answer. We would also like to send the message to reconsider the issue of property path semantics and we are very happy to observe that the working group is already discussing this matter. Also, we are open to continue the discussion with the group and with our colleagues Marcelo Arenas, Sebastian Conca, and Jorge Perez and, if necessary, cooperate with the group to make a proposal that can have an efficient evaluation method. Best regards, Wim Martens Katja Losemann
Received on Monday, 20 February 2012 13:55:35 UTC