Additional comments on the semantics of property paths from W Martens on 2012-02-20 (public-rdf-dawg-comments@w3.org from February 2012)

From: W Martens <martens.wim@gmail.com>
Date: Mon, 20 Feb 2012 14:51:55 +0100
To: public-rdf-dawg-comments@w3.org
Message-ID: <CAMBaYMZgTN1UgbSM-DgK+n_nMO6QYLWcKk7FzS-_fA0h0o436A@mail.gmail.com>
Dear DAWG members,

we tried to post the following message on public-rdf-dawg last week, but it
doesn't seem to appear. Perhaps it was not the correct mailing list for this
kind of comment.


We have some comments regarding the semantics of property paths, which
give some extra information about the issues already raised by Marcelo
Arenas, Sebastian Conca, and Jorge Perez in their post a few weeks ago
(http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html).


We have noticed that the Working Group is already discussing the issue of
property path semantics intensely and we would like to share our findings
to aid in the discussion. We performed a mostly foundational study of the
current semantics of property paths in a paper which has been accepted at
PODS 2012. A preprint is available for download at
http://www.theoinf.uni-bayreuth.de/download/pods12submission.pdf.

This work was conducted independently from the work by Arenas,
Conca, and Perez and it is remarkable that, in many cases, we arrive to
very similar conclusions. In what comes next, we will try not to repeat
what has been said before, but focus on giving new information.



On the current semantics of Property Paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The current semantics of property paths in SPARQL 1.1 relies on two
choices which have a dramatic impact on the complexity of query
evaluation.

In the paper, we refer to these as
(A) the "simple walk requirement"
(B) the necessity of "counting paths" for computing query answers


To clarify (A) and (B):

By simple walk semantics, we refer to the definitions of ZeroOrMorePath
and OneOrMorePath in Section 18.4 in the working draft (SPARQL Algebra).
These definitions state that the answers to these operators are obtained
by "repeated use of path such that any nodes in the graph are traversed
once only". A path that traverses any node in the graph only once (and
possibly returns to its first node) is a simple walk.

Condition (B) has already been mentioned in the post by Arenas, Conca,
and Perez.



Implications of the current semantics on Property Path evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each of requirements (A) and (B) make property paths hard to evaluate.

More precisely, each of these requirements makes the evaluation problem
of property paths NP-hard or #P-hard, already for very, very simple queries.
Details are in the paper.


Can this be solved?
~~~~~~~~~~~~~~~


We discovered that there could be an alternative semantics for property
paths that alleviates the before mentioned efficiency problems. In
particular, consider the following conditions, which might be
alternatives for (A) and (B):

(A') regular path semantics
(B') only check the existence of a path

Option (A') would be the same than (A), but no longer require that any
node in the graph is traversed once only.
Option (B') is what [Arenas, Conca, and Perez] refer to as "existential
semantics".


Our findings regarding the trade-offs between (A) / (A') and (B) / (B')
can be summarized as follows:


Simple walk semantics vs Regular path semantics
-----------------------------------------------
If the default behavior of a SPARQL property path would have conditions
(A') and (B'), then our paper gives an algorithm to evaluate SPARQL
property paths in polynomial time combined complexity (Theorem 3.2). The
algorithm in Theorem 3.2 can be easily adapted to compute all pairs (X,Y)
that should be returned by a SELECT X,Y WHERE { X :p Y } query in
polynomial time. Notice that this is a *double exponential* improvement
wrt. current implementations.
Furthermore, :p can be an arbitrary property path, including counters.
This means that, for evaluating a property path of the form p[200,300],
our algorithm only needs polynomial time in the number of bits needed to
represent the number 300.

However, this improvement is not possible under requirement (A).
Requirement (A) already makes it NP-complete to test whether a given pair
(X,Y) should be in the answer of the above SELECT query or not.

Therefore, our feeling is that the choice between (A) or (A') is
worthwhile reconsidering since (A') could dramatically improve
performance.


To Count or not?
----------------
>From a user perspective, it makes sense to have the option to choose
between (B) or (B').
Sometimes, you like to count duplicates in an answer to a query and
sometimes you don't.

On this topic, we largely agree with the work of [Arenas, Conca, and
Perez]. They suggest to apply (B') as a default semantics to property
paths, and to use a keyword in the case that the user desires to have
duplicates in the answer.






We would also like to send the message to reconsider the issue of
property path semantics and we are very happy to observe that the
working group is already discussing this matter. Also, we are open
to continue the discussion with the group and with our colleagues
Marcelo Arenas, Sebastian Conca, and Jorge Perez and, if
necessary, cooperate with the group to make a proposal that can
have an efficient evaluation method.


Best regards,

Wim Martens
Katja Losemann
Received on Monday, 20 February 2012 13:55:35 UTC