Re: Comments about the semantics of property paths from Rob Vesse on 2012-04-13 (public-rdf-dawg-comments@w3.org from April 2012)

From: Rob Vesse <rav08r@ecs.soton.ac.uk>
Date: Fri, 13 Apr 2012 02:30:06 +0000
To: "Lee Feigenbaum" <lee@thefigtrees.net>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <1395662661-1334284218-cardhu_decombobulator_blackberry.rim.net-799401330-@b4.c1>
Lee

I appreciate that it will take a while to revise the draft but I would want to see new tests at the same time since per the implementation report you have 4 fully/partially complete implementations per the current test suite you just broke (incl my own) all of which are going to want to update themselves as soon as the draft comes out (or sooner) or at least asses what impact the change has on them

Obviously test cases take time to formulate and be approved but in the interim some simple examples on this or another mailing list/blog post explaining the change in terms of expected results pre and post this change would be most appreciated

Last time the WG changed property paths (which admittedly was a year or so ago) I literally had to throw out and rewrite my implementation and I'd like some way to asses if this is the case this time so I can schedule this appropriately in my release cycle even if I won't necessarily be making the changes for a month or two

Rob

Sent via BlackBerry by AT&T

-----Original Message-----
From: Lee Feigenbaum <lee@thefigtrees.net>
Sender: Lee Feigenbaum <figtree@gmail.com>
Date: Thu, 12 Apr 2012 21:38:52 
To: Rob Vesse<rav08r@ecs.soton.ac.uk>
Cc: <public-rdf-dawg-comments@w3.org>; David Mizell<dmizell@yarcdata.com>
Subject: Re: Comments about the semantics of property paths

Hi Rob,

We're hoping to have this in the editors' draft within the next couple 
of weeks, and barring any change of plan we'll be publishing an updated 
Last Call Working Draft shortly thereafter.

As we move beyond Last Call, we'll also be publishing test cases that 
reflect the final semantics of this part of the language.

Lee

On 4/12/2012 2:07 PM, Rob Vesse wrote:
> Hi Lee
>
> Now the working group has made this decision how soon can we expect to
> see updated drafts and documentation detailing the difference between
> the new semantics and the old semantics.
>
> Personally as an implementer I am not sure that I understand what the
> difference in semantics is as a result of this decision and I'd very
> much like to see some worked examples which demonstrate this
>
> Regards,
>
> Rob
>
> On 4/12/12 4:57 AM, Lee Feigenbaum wrote:
>> Hello Jorge, Marcelo, and Sebastian,
>>
>> Thanks very much for your comment on the challenges of efficiently
>> evaluating property paths given the current semantics. We have
>> received several comments along these lines.
>>
>> The Working Group agreed to adopt an alternative design for property
>> paths that we believe balances several of our design goals:
>>
>>      Use case expressivity
>>      Users' learning curve / intuition
>>      Ease of implementation
>>      Potential for efficient evaluation
>>      Schedule
>>
>> The changes from the current Last Call working draft are as follows:
>>
>>      The semantics of *, +, and ? are changed to be non-counting (they
>> no longer preserve duplicates)
>>      The /, |, and ! remain unchanged as in the current draft (they
>> preserve duplicates)
>>      The curly brace forms -- {n}, {n,m}, {n,}, {,m} -- have all been
>> removed
>>
>> There are several motivations for these design changes:
>>
>>      Changing the semantics of * and + to non-counting is expected to
>> address the evaluation performance concerns raised by your comment.
>>      The /, |, and ! operators are often used as shortcuts for writing
>> out equivalent graph patterns longhand. By leaving these operators
>> with counting semantics (just as the equivalent graph pattern
>> expansions have), SPARQL 1.1 property paths will continue to yield
>> intuitive results. Two examples that the Working Group has considered
>> in coming to this conclusion are:
>>
>>     SELECT (sum(?cost) AS ?total)
>>     {
>>       :order :hasItem/:price ?cost
>>     }
>>
>> and
>>
>>     SELECT ?member { ?list rdf:rest*/rdf:first ?member }
>>
>> ...when applied to a list with duplicate items, such as (1 2 1). (This
>> type of query was one motivation for the property paths feature that
>> the WG documented in the SPARQL 1.1 New Features&  Rationale document:
>> http://www.w3.org/TR/sparql-features/#Property_paths)
>>
>> In both cases, users' intuitive expectations demands that '/' preserve
>> duplicates.
>>
>>      The group decided to remove the curly brace forms due to a lack of
>> experience with appropriate counting/non-counting semantics for these
>> forms. We believe that implementations will likely extend SPARQL 1.1
>> Property Paths with support for these forms, and we will gather
>> implementation experience to use for future standardization work.
>> We've included this item on a list of work items for a future working
>> group - http://www.w3.org/2009/sparql/wiki/Future_Work_Items

>>
>> We would be grateful if you would acknowledge that your comment has
>> been answered by sending a reply to this mailing list.
>>
>> Lee On behalf of the SPARQL Working Group
>>
>> On 1/27/2012 9:51 AM, jorge perez wrote:
>>> Dear DAWG members,
>>>
>>> We have some comments regarding the semantics of property paths. We
>>> know that this issue has been raised before, but we think that we can
>>> provide substantial new information to reconsider it.
>>>
>>> We have conducted a thorough study of the current semantics of
>>> property paths including an empirical analysis. All our results are in
>>> a paper that has been accepted in WWW 2012. You can find a
>>> copy of the extended version of this paper at
>>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight
>>> schedule of the group, we think that it might be useful to make these
>>> results public for the group before we have a final published version.
>>>
>>> As a summary we can provide two main comments, one from a practical
>>> perspective and another from a theoretical perspective.
>>>
>>> -----------
>>>
>>> - Comment 1: Poor performance of current engines.
>>> =================================================
>>> We tested 4 implementations of property paths: Jena, RDF::Query,
>>> Sesame, and KGram-Corese (details on the experimental setting can be
>>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first
>>> set of experiments was with synthetic data and other with real data.
>>>
>>> In both cases the implementations were not capable to handle even
>>> small data for the most simple property path queries.
>>>
>>> Case A)
>>> We tested RDF data representing complete graphs. No implementation was
>>> able to handle a graph with 13 nodes for a query with a single
>>> property path of the form (:P)*
>>>
>>> data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3

>>> query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq

>>>
>>> See Figure 1 in the paper for the performance of all implementations
>>> below 13 nodes. The figure suggests that the evaluation time for these
>>> implementations growths doubly-exponentially w.r.t. the size of the
>>> input.
>>>
>>> Case B)
>>> We tested real RDF data crawled from a small set of foaf documents. We
>>> started from Axel's foaf document and retrieve friends, friends of
>>> friends, etc. following foaf:knows links, and constructed several test
>>> cases. In this case, no implementation was able to handle an RDF graph
>>> of 14KB for a query with a single property path (foaf:knows)*
>>>
>>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3

>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq

>>>
>>> See Table 1 in the paper for the performance of all implementations.
>>>
>>> - Comment 2: High Computational Complexity.
>>> ===========================================
>>> We prove in the paper that for the current semantics of property paths
>>> in SPARQL the complexity of evaluation is double-exponential (Lemma
>>> 5.4 and Theorem 5.5). Given that property paths require counting
>>> paths, we measure the complexity by making use of counting complexity
>>> classes. The technical result is that SPARQL 1.1 evaluation is not
>>> even inside #P (Theorem 5.5), where #P is the counting complexity
>>> class associated to NP (a prototypical #P-complete problem is the
>>> problem
>>> of computing the number of truth assignments that satisfies a
>>> propositional
>>> formula, which is more complicated than the prototypical NP-complete
>>> problem
>>> which is to verify whether there exists at least one truth assignment
>>> that
>>> satisfies a propositional formula). Thus, in informal terms, we prove
>>> that
>>> SPARQL 1.1 evaluation considering counting is even more complex than
>>> solving an NP-complete problem.
>>>
>>> We also prove that if only the input data is considered to measure the
>>> complexity of the problem, then the evaluation problem is #P-hard.
>>> Notice that without property paths, the evaluation problem for SPARQL
>>> can be solved in polynomial time (if the complexity is measured only in
>>> terms of the size of the data).
>>>
>>> ------------
>>>
>>> Discussion
>>> ==========
>>>
>>> One of the main conclusions that one can draw from our results is that
>>> the poor performance exhibited in Cases A) and B) above is not a
>>> problem of the particular implementations but a problem of the
>>> specification itself, as our theoretical results imply that every
>>> implementation that follows the current specification of SPARQL 1.1
>>> would have the same poor behavior.
>>>
>>> Our results also show that the main source of complexity is the
>>> requirement of counting paths, and in particular the procedure ALP
>>> which is the one that gives the semantics for counting. Essentially,
>>> the counting mechanism produces a number of duplicates that in some
>>> cases are beyond any naturally feasible number. Table 7 in the paper
>>> shows a worst case analysis. For instance, for the case
>>>
>>> data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3

>>> query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq

>>>
>>> we formally prove that any implementation that follows the current
>>> specification should produce an output of 79 Yottabytes (79 trillion
>>> Terabytes), and thus would not fit in any reasonable storage device.
>>> Please notice that unfeasible counting can also be obtained with real
>>> data. For example, for the case
>>>
>>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3

>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq

>>>
>>> ARQ (which was the only implementation that was able to handle this
>>> case in less than one hour) produced an output of 587MB. Notice that
>>> data3 is of only 13.2KB. Table 6 in the paper shows the running time
>>> and the output size. Please notice that this experiment is with real
>>> data and it is highly probable that property paths will be used in
>>> practice with this type of queries.
>>>
>>> It is worth mentioning that our group is not the only one that have
>>> formally studied property-path semantics according to the current
>>> specification, and that have shown negative results about the complexity
>>> of evaluating it. We are aware that Katja Losemann and Wim Martens
>>> obtained similar results independently from us. Wim Martens gave a
>>> talk about this called "The complexity of evaluating path expressions
>>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also
>>> studied property-path expressions of the form :P{m,n}, and show that
>>> the complexity of evaluating them is very high.
>>>
>>> We think that we have provided substantial new information to
>>> reconsider the issue of property path semantics. We have several other
>>> comments, but we think that the two comments above are the most
>>> important to consider, and we are open to continue the discussion with
>>> the group and, if necessary, cooperate with the group to make a proposal
>>> for property path evaluation that can have an efficient evaluation
>>> method.
>>>
>>> Best regards,
>>>
>>> Marcelo Arenas
>>> Sebastián Conca
>>> Jorge Pérez
>>>
>>>
>>
>
>
Received on Friday, 13 April 2012 02:30:54 UTC