Re: Additional comments on the semantics of property paths from Lee Feigenbaum on 2012-04-12 (public-rdf-dawg-comments@w3.org from April 2012)

From: Lee Feigenbaum <lee@thefigtrees.net>
Date: Thu, 12 Apr 2012 07:58:31 -0400
To: Jeen Broekstra <jeen.broekstra@gmail.com>
CC: public-rdf-dawg-comments@w3.org
Message-ID: <4F86C367.2080606@thefigtrees.net>
Hello Jeen,

Thanks very much for your comment on the challenges of efficiently 
evaluating property paths given the current semantics. We have received 
several comments along these lines.

The Working Group agreed to adopt an alternative design for property 
paths that we believe balances several of our design goals:

     Use case expressivity
     Users' learning curve / intuition
     Ease of implementation
     Potential for efficient evaluation
     Schedule

The changes from the current Last Call working draft are as follows:

     The semantics of *, +, and ? are changed to be non-counting (they 
no longer preserve duplicates)
     The /, |, and ! remain unchanged as in the current draft (they 
preserve duplicates)
     The curly brace forms -- {n}, {n,m}, {n,}, {,m} -- have all been 
removed

There are several motivations for these design changes:

     Changing the semantics of * and + to non-counting is expected to 
address the evaluation performance concerns raised by your comment.
     The /, |, and ! operators are often used as shortcuts for writing 
out equivalent graph patterns longhand. By leaving these operators with 
counting semantics (just as the equivalent graph pattern expansions 
have), SPARQL 1.1 property paths will continue to yield intuitive 
results. Two examples that the Working Group has considered in coming to 
this conclusion are:

    SELECT (sum(?cost) AS ?total)
    {
      :order :hasItem/:price ?cost
    }

and

    SELECT ?member { ?list rdf:rest*/rdf:first ?member }

...when applied to a list with duplicate items, such as (1 2 1). (This 
type of query was one motivation for the property paths feature that the 
WG documented in the SPARQL 1.1 New Features & Rationale document: 
http://www.w3.org/TR/sparql-features/#Property_paths)

In both cases, users' intuitive expectations demands that '/' preserve 
duplicates.

     The group decided to remove the curly brace forms due to a lack of 
experience with appropriate counting/non-counting semantics for these 
forms. We believe that implementations will likely extend SPARQL 1.1 
Property Paths with support for these forms, and we will gather 
implementation experience to use for future standardization work. We've 
included this item on a list of work items for a future working group - 
http://www.w3.org/2009/sparql/wiki/Future_Work_Items

We would be grateful if you would acknowledge that your comment has been 
answered by sending a reply to this mailing list.

Lee On behalf of the SPARQL Working Group

On 2/20/2012 9:27 PM, Jeen Broekstra wrote:
>
> My 2 cents:
>
> I agree with the conclusion that property path semantics are complex and
> hard to implement in a scalable fashion. Especially the obligation to
> retrieve all possible paths is impractical.
>
> Allow me to provide an anecdotal data point as input: in a previous
> version of Sesame property-paths were implemented in accordance with the
> current WD spec (that is, returning all possible paths, which leads to
> duplicate results).
>
> This, however, proved to be unworkably slow and complex in practice: a
> simple query traversing the skos:broader hierarchy of the DBPedia
> category taxonomy took over 15 minutes to complete.
>
> After reimplementing the algorithm in a way that removes the need for
> keeping track of all paths (thus eliminating duplicate results), that
> same query completes in slightly over 150ms.
>
> As a consequence of this, the current version of Sesame can handle
> arbitrary-length property paths quite well (at least, better than the
> previous version), and the current development branch passes all SPARQL
> query conformance tests on property paths, except those tests where
> duplicate results are expected.
>
> See
> http://rivuli-development.com/2011/11/implementing-sparql-arbitrary-length-property-paths-redux/
> for a slightly more detailed description of the implementation issues we
> faced and the chosen solution.
>
> Aside from performance issues, I would argue that removing the need to
> return duplicate results is also more likely to be a desirable outcome
> from a user POV. I'm not convinced that the use case for counting paths
> is really all that compelling. YMMV though.
>
> Regards,
>
> Jeen
>
> On 21/02/12 02:51, W Martens wrote:
>> Dear DAWG members,
>>
>> we tried to post the following message on public-rdf-dawg last week, but it
>> doesn't seem to appear. Perhaps it was not the correct mailing list for this
>> kind of comment.
>>
>>
>> We have some comments regarding the semantics of property paths, which
>> give some extra information about the issues already raised by Marcelo
>> Arenas, Sebastian Conca, and Jorge Perez in their post a few weeks ago
>> (http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html).
>>
>>
>> We have noticed that the Working Group is already discussing the issue of
>> property path semantics intensely and we would like to share our findings
>> to aid in the discussion. We performed a mostly foundational study of the
>> current semantics of property paths in a paper which has been accepted at
>> PODS 2012. A preprint is available for download at
>> http://www.theoinf.uni-bayreuth.de/download/pods12submission.pdf.
>>
>> This work was conducted independently from the work by Arenas,
>> Conca, and Perez and it is remarkable that, in many cases, we arrive to
>> very similar conclusions. In what comes next, we will try not to repeat
>> what has been said before, but focus on giving new information.
>>
>>
>>
>> On the current semantics of Property Paths
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> The current semantics of property paths in SPARQL 1.1 relies on two
>> choices which have a dramatic impact on the complexity of query
>> evaluation.
>>
>> In the paper, we refer to these as
>> (A) the "simple walk requirement"
>> (B) the necessity of "counting paths" for computing query answers
>>
>>
>> To clarify (A) and (B):
>>
>> By simple walk semantics, we refer to the definitions of ZeroOrMorePath
>> and OneOrMorePath in Section 18.4 in the working draft (SPARQL Algebra).
>> These definitions state that the answers to these operators are obtained
>> by "repeated use of path such that any nodes in the graph are traversed
>> once only". A path that traverses any node in the graph only once (and
>> possibly returns to its first node) is a simple walk.
>>
>> Condition (B) has already been mentioned in the post by Arenas, Conca,
>> and Perez.
>>
>>
>>
>> Implications of the current semantics on Property Path evaluation
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> Each of requirements (A) and (B) make property paths hard to evaluate.
>>
>> More precisely, each of these requirements makes the evaluation problem
>> of property paths NP-hard or #P-hard, already for very, very simple queries.
>> Details are in the paper.
>>
>>
>> Can this be solved?
>> ~~~~~~~~~~~~~~~
>>
>>
>> We discovered that there could be an alternative semantics for property
>> paths that alleviates the before mentioned efficiency problems. In
>> particular, consider the following conditions, which might be
>> alternatives for (A) and (B):
>>
>> (A') regular path semantics
>> (B') only check the existence of a path
>>
>> Option (A') would be the same than (A), but no longer require that any
>> node in the graph is traversed once only.
>> Option (B') is what [Arenas, Conca, and Perez] refer to as "existential
>> semantics".
>>
>>
>> Our findings regarding the trade-offs between (A) / (A') and (B) / (B')
>> can be summarized as follows:
>>
>>
>> Simple walk semantics vs Regular path semantics
>> -----------------------------------------------
>> If the default behavior of a SPARQL property path would have conditions
>> (A') and (B'), then our paper gives an algorithm to evaluate SPARQL
>> property paths in polynomial time combined complexity (Theorem 3.2). The
>> algorithm in Theorem 3.2 can be easily adapted to compute all pairs (X,Y)
>> that should be returned by a SELECT X,Y WHERE { X :p Y } query in
>> polynomial time. Notice that this is a *double exponential* improvement
>> wrt. current implementations.
>> Furthermore, :p can be an arbitrary property path, including counters.
>> This means that, for evaluating a property path of the form p[200,300],
>> our algorithm only needs polynomial time in the number of bits needed to
>> represent the number 300.
>>
>> However, this improvement is not possible under requirement (A).
>> Requirement (A) already makes it NP-complete to test whether a given pair
>> (X,Y) should be in the answer of the above SELECT query or not.
>>
>> Therefore, our feeling is that the choice between (A) or (A') is
>> worthwhile reconsidering since (A') could dramatically improve
>> performance.
>>
>>
>> To Count or not?
>> ----------------
>> > From a user perspective, it makes sense to have the option to choose
>> between (B) or (B').
>> Sometimes, you like to count duplicates in an answer to a query and
>> sometimes you don't.
>>
>> On this topic, we largely agree with the work of [Arenas, Conca, and
>> Perez]. They suggest to apply (B') as a default semantics to property
>> paths, and to use a keyword in the case that the user desires to have
>> duplicates in the answer.
>>
>>
>>
>>
>>
>>
>> We would also like to send the message to reconsider the issue of
>> property path semantics and we are very happy to observe that the
>> working group is already discussing this matter. Also, we are open
>> to continue the discussion with the group and with our colleagues
>> Marcelo Arenas, Sebastian Conca, and Jorge Perez and, if
>> necessary, cooperate with the group to make a proposal that can
>> have an efficient evaluation method.
>>
>>
>> Best regards,
>>
>> Wim Martens
>> Katja Losemann
>>
>>
>
>
>
Received on Thursday, 12 April 2012 11:59:05 UTC