Re: Additional comments on the semantics of property paths from Jeen Broekstra on 2012-02-21 (public-rdf-dawg-comments@w3.org from February 2012)

From: Jeen Broekstra <jeen.broekstra@gmail.com>
Date: Tue, 21 Feb 2012 15:27:17 +1300
To: public-rdf-dawg-comments@w3.org
Message-ID: <4F430105.6080008@gmail.com>
My 2 cents:

I agree with the conclusion that property path semantics are complex and
hard to implement in a scalable fashion. Especially the obligation to
retrieve all possible paths is impractical.

Allow me to provide an anecdotal data point as input: in a previous
version of Sesame property-paths were implemented in accordance with the
current WD spec (that is, returning all possible paths, which leads to
duplicate results).

This, however, proved to be unworkably slow and complex in practice: a
simple query traversing the skos:broader hierarchy of the DBPedia
category taxonomy took over 15 minutes to complete.

After reimplementing the algorithm in a way that removes the need for
keeping track of all paths (thus eliminating duplicate results), that
same query completes in slightly over 150ms.

As a consequence of this, the current version of Sesame can handle
arbitrary-length property paths quite well (at least, better than the
previous version), and the current development branch passes all SPARQL
query conformance tests on property paths, except those tests where
duplicate results are expected.

See
http://rivuli-development.com/2011/11/implementing-sparql-arbitrary-length-property-paths-redux/
for a slightly more detailed description of the implementation issues we
faced and the chosen solution.

Aside from performance issues, I would argue that removing the need to
return duplicate results is also more likely to be a desirable outcome
from a user POV. I'm not convinced that the use case for counting paths
is really all that compelling. YMMV though.

Regards,

Jeen

On 21/02/12 02:51, W Martens wrote:
> Dear DAWG members,
> 
> we tried to post the following message on public-rdf-dawg last week, but it
> doesn't seem to appear. Perhaps it was not the correct mailing list for this
> kind of comment.
> 
> 
> We have some comments regarding the semantics of property paths, which
> give some extra information about the issues already raised by Marcelo
> Arenas, Sebastian Conca, and Jorge Perez in their post a few weeks ago
> (http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html).
> 
> 
> We have noticed that the Working Group is already discussing the issue of
> property path semantics intensely and we would like to share our findings
> to aid in the discussion. We performed a mostly foundational study of the
> current semantics of property paths in a paper which has been accepted at
> PODS 2012. A preprint is available for download at
> http://www.theoinf.uni-bayreuth.de/download/pods12submission.pdf.
> 
> This work was conducted independently from the work by Arenas,
> Conca, and Perez and it is remarkable that, in many cases, we arrive to
> very similar conclusions. In what comes next, we will try not to repeat
> what has been said before, but focus on giving new information.
> 
> 
> 
> On the current semantics of Property Paths
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The current semantics of property paths in SPARQL 1.1 relies on two
> choices which have a dramatic impact on the complexity of query
> evaluation.
> 
> In the paper, we refer to these as
> (A) the "simple walk requirement"
> (B) the necessity of "counting paths" for computing query answers
> 
> 
> To clarify (A) and (B):
> 
> By simple walk semantics, we refer to the definitions of ZeroOrMorePath
> and OneOrMorePath in Section 18.4 in the working draft (SPARQL Algebra).
> These definitions state that the answers to these operators are obtained
> by "repeated use of path such that any nodes in the graph are traversed
> once only". A path that traverses any node in the graph only once (and
> possibly returns to its first node) is a simple walk.
> 
> Condition (B) has already been mentioned in the post by Arenas, Conca,
> and Perez.
> 
> 
> 
> Implications of the current semantics on Property Path evaluation
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Each of requirements (A) and (B) make property paths hard to evaluate.
> 
> More precisely, each of these requirements makes the evaluation problem
> of property paths NP-hard or #P-hard, already for very, very simple queries.
> Details are in the paper.
> 
> 
> Can this be solved?
> ~~~~~~~~~~~~~~~
> 
> 
> We discovered that there could be an alternative semantics for property
> paths that alleviates the before mentioned efficiency problems. In
> particular, consider the following conditions, which might be
> alternatives for (A) and (B):
> 
> (A') regular path semantics
> (B') only check the existence of a path
> 
> Option (A') would be the same than (A), but no longer require that any
> node in the graph is traversed once only.
> Option (B') is what [Arenas, Conca, and Perez] refer to as "existential
> semantics".
> 
> 
> Our findings regarding the trade-offs between (A) / (A') and (B) / (B')
> can be summarized as follows:
> 
> 
> Simple walk semantics vs Regular path semantics
> -----------------------------------------------
> If the default behavior of a SPARQL property path would have conditions
> (A') and (B'), then our paper gives an algorithm to evaluate SPARQL
> property paths in polynomial time combined complexity (Theorem 3.2). The
> algorithm in Theorem 3.2 can be easily adapted to compute all pairs (X,Y)
> that should be returned by a SELECT X,Y WHERE { X :p Y } query in
> polynomial time. Notice that this is a *double exponential* improvement
> wrt. current implementations.
> Furthermore, :p can be an arbitrary property path, including counters.
> This means that, for evaluating a property path of the form p[200,300],
> our algorithm only needs polynomial time in the number of bits needed to
> represent the number 300.
> 
> However, this improvement is not possible under requirement (A).
> Requirement (A) already makes it NP-complete to test whether a given pair
> (X,Y) should be in the answer of the above SELECT query or not.
> 
> Therefore, our feeling is that the choice between (A) or (A') is
> worthwhile reconsidering since (A') could dramatically improve
> performance.
> 
> 
> To Count or not?
> ----------------
>>From a user perspective, it makes sense to have the option to choose
> between (B) or (B').
> Sometimes, you like to count duplicates in an answer to a query and
> sometimes you don't.
> 
> On this topic, we largely agree with the work of [Arenas, Conca, and
> Perez]. They suggest to apply (B') as a default semantics to property
> paths, and to use a keyword in the case that the user desires to have
> duplicates in the answer.
> 
> 
> 
> 
> 
> 
> We would also like to send the message to reconsider the issue of
> property path semantics and we are very happy to observe that the
> working group is already discussing this matter. Also, we are open
> to continue the discussion with the group and with our colleagues
> Marcelo Arenas, Sebastian Conca, and Jorge Perez and, if
> necessary, cooperate with the group to make a proposal that can
> have an efficient evaluation method.
> 
> 
> Best regards,
> 
> Wim Martens
> Katja Losemann
> 
>
Received on Tuesday, 21 February 2012 02:27:52 UTC