Re: Fwd: Comments about the semantics of property paths from Lee Feigenbaum on 2012-01-29 (public-rdf-dawg@w3.org from January to March 2012)

From: Lee Feigenbaum <lee@thefigtrees.net>
Date: Sun, 29 Jan 2012 14:58:42 -0500
To: Andy Seaborne <andy.seaborne@epimorphics.com>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4F25A4F2.1090308@thefigtrees.net>
On 1/29/2012 2:28 PM, Andy Seaborne wrote:
> It would be useful but I wasn't expecting a change to the plan of
> approving tests and progressing the graph store protocol discussions.

GSP is high priority.

Tests are medium priority, as they're more directly related to CR.

I do see new information in this latest comment, so I'd like to discuss 
it, as it is high priority to know whether it might affect our schedule.

Lee

> End of last call for query is Feb 6th - I don't expect we will be
> replying until after that point anyway, but, if there's time, we can
> make a start discussing it.
>
> Andy
>
> On 29/01/12 19:03, Lee Feigenbaum wrote:
>> I'd suggest we discuss on our call this week.
>>
>> Lee
>>
>> On 1/28/2012 12:38 PM, Andy Seaborne wrote:
>>> A previous conversation of these issues includes:
>>>
>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html
>>>
>>>
>>>
>>>
>>> where the WG points out:
>>> 1/ That aggregation may yield confusing results to a natural query
>>> 2/ That an optimizer may be given further information via a sub-query
>>> and DISTINCT.
>>>
>>> The purchase order example seems to me to be a reasonable expectation of
>>> any spreadsheet user. The same price is arrived at several times (two
>>> ways: via two :item1's and two uses of the same literal 2).
>>>
>>> There are two sets of use cases: one set where duplicates are essential,
>>> and one set where they are redundant.
>>>
>>> Jorge's reply then:
>>>
>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html
>>>
>>>
>>>
>>>
>>> acknowledged the points in the response.
>>>
>>> Andy
>>>
>>> -------- Original Message --------
>>> Subject: Comments about the semantics of property paths
>>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000
>>> Resent-From: public-rdf-dawg-comments@w3.org
>>> Date: Fri, 27 Jan 2012 11:51:45 -0300
>>> From: jorge perez <jorge.perez.rojas@gmail.com>
>>> To: public-rdf-dawg-comments@w3.org
>>> CC: Marcelo Arenas <marcelo.arenas1@gmail.com>, Sebastián Conca
>>> <sconca87@gmail.com>, jorge perez <jorge.perez.rojas@gmail.com>
>>>
>>> Dear DAWG members,
>>>
>>> We have some comments regarding the semantics of property paths. We
>>> know that this issue has been raised before, but we think that we can
>>> provide substantial new information to reconsider it.
>>>
>>> We have conducted a thorough study of the current semantics of
>>> property paths including an empirical analysis. All our results are in
>>> a paper that has been accepted in WWW 2012. You can find a
>>> copy of the extended version of this paper at
>>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight
>>> schedule of the group, we think that it might be useful to make these
>>> results public for the group before we have a final published version.
>>>
>>> As a summary we can provide two main comments, one from a practical
>>> perspective and another from a theoretical perspective.
>>>
>>> -----------
>>>
>>> - Comment 1: Poor performance of current engines.
>>> =================================================
>>> We tested 4 implementations of property paths: Jena, RDF::Query,
>>> Sesame, and KGram-Corese (details on the experimental setting can be
>>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first
>>> set of experiments was with synthetic data and other with real data.
>>>
>>> In both cases the implementations were not capable to handle even
>>> small data for the most simple property path queries.
>>>
>>> Case A)
>>> We tested RDF data representing complete graphs. No implementation was
>>> able to handle a graph with 13 nodes for a query with a single
>>> property path of the form (:P)*
>>>
>>> data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3
>>> query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq
>>>
>>> See Figure 1 in the paper for the performance of all implementations
>>> below 13 nodes. The figure suggests that the evaluation time for these
>>> implementations growths doubly-exponentially w.r.t. the size of the
>>> input.
>>>
>>> Case B)
>>> We tested real RDF data crawled from a small set of foaf documents. We
>>> started from Axel's foaf document and retrieve friends, friends of
>>> friends, etc. following foaf:knows links, and constructed several test
>>> cases. In this case, no implementation was able to handle an RDF graph
>>> of 14KB for a query with a single property path (foaf:knows)*
>>>
>>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3
>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
>>>
>>> See Table 1 in the paper for the performance of all implementations.
>>>
>>> - Comment 2: High Computational Complexity.
>>> ===========================================
>>> We prove in the paper that for the current semantics of property paths
>>> in SPARQL the complexity of evaluation is double-exponential (Lemma
>>> 5.4 and Theorem 5.5). Given that property paths require counting
>>> paths, we measure the complexity by making use of counting complexity
>>> classes. The technical result is that SPARQL 1.1 evaluation is not
>>> even inside #P (Theorem 5.5), where #P is the counting complexity
>>> class associated to NP (a prototypical #P-complete problem is the
>>> problem
>>> of computing the number of truth assignments that satisfies a
>>> propositional
>>> formula, which is more complicated than the prototypical NP-complete
>>> problem
>>> which is to verify whether there exists at least one truth assignment
>>> that
>>> satisfies a propositional formula). Thus, in informal terms, we prove
>>> that
>>> SPARQL 1.1 evaluation considering counting is even more complex than
>>> solving an NP-complete problem.
>>>
>>> We also prove that if only the input data is considered to measure the
>>> complexity of the problem, then the evaluation problem is #P-hard.
>>> Notice that without property paths, the evaluation problem for SPARQL
>>> can be solved in polynomial time (if the complexity is measured only in
>>> terms of the size of the data).
>>>
>>> ------------
>>>
>>> Discussion
>>> ==========
>>>
>>> One of the main conclusions that one can draw from our results is that
>>> the poor performance exhibited in Cases A) and B) above is not a
>>> problem of the particular implementations but a problem of the
>>> specification itself, as our theoretical results imply that every
>>> implementation that follows the current specification of SPARQL 1.1
>>> would have the same poor behavior.
>>>
>>> Our results also show that the main source of complexity is the
>>> requirement of counting paths, and in particular the procedure ALP
>>> which is the one that gives the semantics for counting. Essentially,
>>> the counting mechanism produces a number of duplicates that in some
>>> cases are beyond any naturally feasible number. Table 7 in the paper
>>> shows a worst case analysis. For instance, for the case
>>>
>>> data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3
>>> query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq
>>>
>>> we formally prove that any implementation that follows the current
>>> specification should produce an output of 79 Yottabytes (79 trillion
>>> Terabytes), and thus would not fit in any reasonable storage device.
>>> Please notice that unfeasible counting can also be obtained with real
>>> data. For example, for the case
>>>
>>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3
>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
>>>
>>> ARQ (which was the only implementation that was able to handle this
>>> case in less than one hour) produced an output of 587MB. Notice that
>>> data3 is of only 13.2KB. Table 6 in the paper shows the running time
>>> and the output size. Please notice that this experiment is with real
>>> data and it is highly probable that property paths will be used in
>>> practice with this type of queries.
>>>
>>> It is worth mentioning that our group is not the only one that have
>>> formally studied property-path semantics according to the current
>>> specification, and that have shown negative results about the complexity
>>> of evaluating it. We are aware that Katja Losemann and Wim Martens
>>> obtained similar results independently from us. Wim Martens gave a
>>> talk about this called "The complexity of evaluating path expressions
>>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also
>>> studied property-path expressions of the form :P{m,n}, and show that
>>> the complexity of evaluating them is very high.
>>>
>>> We think that we have provided substantial new information to
>>> reconsider the issue of property path semantics. We have several other
>>> comments, but we think that the two comments above are the most
>>> important to consider, and we are open to continue the discussion with
>>> the group and, if necessary, cooperate with the group to make a proposal
>>> for property path evaluation that can have an efficient evaluation
>>> method.
>>>
>>> Best regards,
>>>
>>> Marcelo Arenas
>>> Sebastián Conca
>>> Jorge Pérez
>>>
>>>
>>>
>
Received on Sunday, 29 January 2012 19:59:11 UTC