- From: Lee Feigenbaum <lee@thefigtrees.net>
- Date: Sun, 29 Jan 2012 14:03:40 -0500
- To: Andy Seaborne <andy.seaborne@epimorphics.com>
- CC: SPARQL Working Group <public-rdf-dawg@w3.org>
I'd suggest we discuss on our call this week. Lee On 1/28/2012 12:38 PM, Andy Seaborne wrote: > A previous conversation of these issues includes: > > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html > > > where the WG points out: > 1/ That aggregation may yield confusing results to a natural query > 2/ That an optimizer may be given further information via a sub-query > and DISTINCT. > > The purchase order example seems to me to be a reasonable expectation of > any spreadsheet user. The same price is arrived at several times (two > ways: via two :item1's and two uses of the same literal 2). > > There are two sets of use cases: one set where duplicates are essential, > and one set where they are redundant. > > Jorge's reply then: > > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html > > > acknowledged the points in the response. > > Andy > > -------- Original Message -------- > Subject: Comments about the semantics of property paths > Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000 > Resent-From: public-rdf-dawg-comments@w3.org > Date: Fri, 27 Jan 2012 11:51:45 -0300 > From: jorge perez <jorge.perez.rojas@gmail.com> > To: public-rdf-dawg-comments@w3.org > CC: Marcelo Arenas <marcelo.arenas1@gmail.com>, Sebastián Conca > <sconca87@gmail.com>, jorge perez <jorge.perez.rojas@gmail.com> > > Dear DAWG members, > > We have some comments regarding the semantics of property paths. We > know that this issue has been raised before, but we think that we can > provide substantial new information to reconsider it. > > We have conducted a thorough study of the current semantics of > property paths including an empirical analysis. All our results are in > a paper that has been accepted in WWW 2012. You can find a > copy of the extended version of this paper at > http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight > schedule of the group, we think that it might be useful to make these > results public for the group before we have a final published version. > > As a summary we can provide two main comments, one from a practical > perspective and another from a theoretical perspective. > > ----------- > > - Comment 1: Poor performance of current engines. > ================================================= > We tested 4 implementations of property paths: Jena, RDF::Query, > Sesame, and KGram-Corese (details on the experimental setting can be > found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first > set of experiments was with synthetic data and other with real data. > > In both cases the implementations were not capable to handle even > small data for the most simple property path queries. > > Case A) > We tested RDF data representing complete graphs. No implementation was > able to handle a graph with 13 nodes for a query with a single > property path of the form (:P)* > > data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3 > query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq > > See Figure 1 in the paper for the performance of all implementations > below 13 nodes. The figure suggests that the evaluation time for these > implementations growths doubly-exponentially w.r.t. the size of the > input. > > Case B) > We tested real RDF data crawled from a small set of foaf documents. We > started from Axel's foaf document and retrieve friends, friends of > friends, etc. following foaf:knows links, and constructed several test > cases. In this case, no implementation was able to handle an RDF graph > of 14KB for a query with a single property path (foaf:knows)* > > data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3 > query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq > > See Table 1 in the paper for the performance of all implementations. > > - Comment 2: High Computational Complexity. > =========================================== > We prove in the paper that for the current semantics of property paths > in SPARQL the complexity of evaluation is double-exponential (Lemma > 5.4 and Theorem 5.5). Given that property paths require counting > paths, we measure the complexity by making use of counting complexity > classes. The technical result is that SPARQL 1.1 evaluation is not > even inside #P (Theorem 5.5), where #P is the counting complexity > class associated to NP (a prototypical #P-complete problem is the problem > of computing the number of truth assignments that satisfies a propositional > formula, which is more complicated than the prototypical NP-complete > problem > which is to verify whether there exists at least one truth assignment that > satisfies a propositional formula). Thus, in informal terms, we prove that > SPARQL 1.1 evaluation considering counting is even more complex than > solving an NP-complete problem. > > We also prove that if only the input data is considered to measure the > complexity of the problem, then the evaluation problem is #P-hard. > Notice that without property paths, the evaluation problem for SPARQL > can be solved in polynomial time (if the complexity is measured only in > terms of the size of the data). > > ------------ > > Discussion > ========== > > One of the main conclusions that one can draw from our results is that > the poor performance exhibited in Cases A) and B) above is not a > problem of the particular implementations but a problem of the > specification itself, as our theoretical results imply that every > implementation that follows the current specification of SPARQL 1.1 > would have the same poor behavior. > > Our results also show that the main source of complexity is the > requirement of counting paths, and in particular the procedure ALP > which is the one that gives the semantics for counting. Essentially, > the counting mechanism produces a number of duplicates that in some > cases are beyond any naturally feasible number. Table 7 in the paper > shows a worst case analysis. For instance, for the case > > data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3 > query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq > > we formally prove that any implementation that follows the current > specification should produce an output of 79 Yottabytes (79 trillion > Terabytes), and thus would not fit in any reasonable storage device. > Please notice that unfeasible counting can also be obtained with real > data. For example, for the case > > data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3 > query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq > > ARQ (which was the only implementation that was able to handle this > case in less than one hour) produced an output of 587MB. Notice that > data3 is of only 13.2KB. Table 6 in the paper shows the running time > and the output size. Please notice that this experiment is with real > data and it is highly probable that property paths will be used in > practice with this type of queries. > > It is worth mentioning that our group is not the only one that have > formally studied property-path semantics according to the current > specification, and that have shown negative results about the complexity > of evaluating it. We are aware that Katja Losemann and Wim Martens > obtained similar results independently from us. Wim Martens gave a > talk about this called "The complexity of evaluating path expressions > in SPARQL" in a Dagstuhl seminar. In that work, the authors also > studied property-path expressions of the form :P{m,n}, and show that > the complexity of evaluating them is very high. > > We think that we have provided substantial new information to > reconsider the issue of property path semantics. We have several other > comments, but we think that the two comments above are the most > important to consider, and we are open to continue the discussion with > the group and, if necessary, cooperate with the group to make a proposal > for property path evaluation that can have an efficient evaluation method. > > Best regards, > > Marcelo Arenas > Sebastián Conca > Jorge Pérez > > >
Received on Sunday, 29 January 2012 19:04:11 UTC