From: Lee Feigenbaum <lee@thefigtrees.net>

Date: Sun, 29 Jan 2012 14:58:42 -0500

Message-ID: <4F25A4F2.1090308@thefigtrees.net>

To: Andy Seaborne <andy.seaborne@epimorphics.com>

CC: SPARQL Working Group <public-rdf-dawg@w3.org>

Date: Sun, 29 Jan 2012 14:58:42 -0500

Message-ID: <4F25A4F2.1090308@thefigtrees.net>

To: Andy Seaborne <andy.seaborne@epimorphics.com>

CC: SPARQL Working Group <public-rdf-dawg@w3.org>

On 1/29/2012 2:28 PM, Andy Seaborne wrote: > It would be useful but I wasn't expecting a change to the plan of > approving tests and progressing the graph store protocol discussions. GSP is high priority. Tests are medium priority, as they're more directly related to CR. I do see new information in this latest comment, so I'd like to discuss it, as it is high priority to know whether it might affect our schedule. Lee > End of last call for query is Feb 6th - I don't expect we will be > replying until after that point anyway, but, if there's time, we can > make a start discussing it. > > Andy > > On 29/01/12 19:03, Lee Feigenbaum wrote: >> I'd suggest we discuss on our call this week. >> >> Lee >> >> On 1/28/2012 12:38 PM, Andy Seaborne wrote: >>> A previous conversation of these issues includes: >>> >>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html >>> >>> >>> >>> >>> where the WG points out: >>> 1/ That aggregation may yield confusing results to a natural query >>> 2/ That an optimizer may be given further information via a sub-query >>> and DISTINCT. >>> >>> The purchase order example seems to me to be a reasonable expectation of >>> any spreadsheet user. The same price is arrived at several times (two >>> ways: via two :item1's and two uses of the same literal 2). >>> >>> There are two sets of use cases: one set where duplicates are essential, >>> and one set where they are redundant. >>> >>> Jorge's reply then: >>> >>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html >>> >>> >>> >>> >>> acknowledged the points in the response. >>> >>> Andy >>> >>> -------- Original Message -------- >>> Subject: Comments about the semantics of property paths >>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000 >>> Resent-From: public-rdf-dawg-comments@w3.org >>> Date: Fri, 27 Jan 2012 11:51:45 -0300 >>> From: jorge perez <jorge.perez.rojas@gmail.com> >>> To: public-rdf-dawg-comments@w3.org >>> CC: Marcelo Arenas <marcelo.arenas1@gmail.com>, Sebastián Conca >>> <sconca87@gmail.com>, jorge perez <jorge.perez.rojas@gmail.com> >>> >>> Dear DAWG members, >>> >>> We have some comments regarding the semantics of property paths. We >>> know that this issue has been raised before, but we think that we can >>> provide substantial new information to reconsider it. >>> >>> We have conducted a thorough study of the current semantics of >>> property paths including an empirical analysis. All our results are in >>> a paper that has been accepted in WWW 2012. You can find a >>> copy of the extended version of this paper at >>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight >>> schedule of the group, we think that it might be useful to make these >>> results public for the group before we have a final published version. >>> >>> As a summary we can provide two main comments, one from a practical >>> perspective and another from a theoretical perspective. >>> >>> ----------- >>> >>> - Comment 1: Poor performance of current engines. >>> ================================================= >>> We tested 4 implementations of property paths: Jena, RDF::Query, >>> Sesame, and KGram-Corese (details on the experimental setting can be >>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first >>> set of experiments was with synthetic data and other with real data. >>> >>> In both cases the implementations were not capable to handle even >>> small data for the most simple property path queries. >>> >>> Case A) >>> We tested RDF data representing complete graphs. No implementation was >>> able to handle a graph with 13 nodes for a query with a single >>> property path of the form (:P)* >>> >>> data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3 >>> query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq >>> >>> See Figure 1 in the paper for the performance of all implementations >>> below 13 nodes. The figure suggests that the evaluation time for these >>> implementations growths doubly-exponentially w.r.t. the size of the >>> input. >>> >>> Case B) >>> We tested real RDF data crawled from a small set of foaf documents. We >>> started from Axel's foaf document and retrieve friends, friends of >>> friends, etc. following foaf:knows links, and constructed several test >>> cases. In this case, no implementation was able to handle an RDF graph >>> of 14KB for a query with a single property path (foaf:knows)* >>> >>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3 >>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq >>> >>> See Table 1 in the paper for the performance of all implementations. >>> >>> - Comment 2: High Computational Complexity. >>> =========================================== >>> We prove in the paper that for the current semantics of property paths >>> in SPARQL the complexity of evaluation is double-exponential (Lemma >>> 5.4 and Theorem 5.5). Given that property paths require counting >>> paths, we measure the complexity by making use of counting complexity >>> classes. The technical result is that SPARQL 1.1 evaluation is not >>> even inside #P (Theorem 5.5), where #P is the counting complexity >>> class associated to NP (a prototypical #P-complete problem is the >>> problem >>> of computing the number of truth assignments that satisfies a >>> propositional >>> formula, which is more complicated than the prototypical NP-complete >>> problem >>> which is to verify whether there exists at least one truth assignment >>> that >>> satisfies a propositional formula). Thus, in informal terms, we prove >>> that >>> SPARQL 1.1 evaluation considering counting is even more complex than >>> solving an NP-complete problem. >>> >>> We also prove that if only the input data is considered to measure the >>> complexity of the problem, then the evaluation problem is #P-hard. >>> Notice that without property paths, the evaluation problem for SPARQL >>> can be solved in polynomial time (if the complexity is measured only in >>> terms of the size of the data). >>> >>> ------------ >>> >>> Discussion >>> ========== >>> >>> One of the main conclusions that one can draw from our results is that >>> the poor performance exhibited in Cases A) and B) above is not a >>> problem of the particular implementations but a problem of the >>> specification itself, as our theoretical results imply that every >>> implementation that follows the current specification of SPARQL 1.1 >>> would have the same poor behavior. >>> >>> Our results also show that the main source of complexity is the >>> requirement of counting paths, and in particular the procedure ALP >>> which is the one that gives the semantics for counting. Essentially, >>> the counting mechanism produces a number of duplicates that in some >>> cases are beyond any naturally feasible number. Table 7 in the paper >>> shows a worst case analysis. For instance, for the case >>> >>> data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3 >>> query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq >>> >>> we formally prove that any implementation that follows the current >>> specification should produce an output of 79 Yottabytes (79 trillion >>> Terabytes), and thus would not fit in any reasonable storage device. >>> Please notice that unfeasible counting can also be obtained with real >>> data. For example, for the case >>> >>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3 >>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq >>> >>> ARQ (which was the only implementation that was able to handle this >>> case in less than one hour) produced an output of 587MB. Notice that >>> data3 is of only 13.2KB. Table 6 in the paper shows the running time >>> and the output size. Please notice that this experiment is with real >>> data and it is highly probable that property paths will be used in >>> practice with this type of queries. >>> >>> It is worth mentioning that our group is not the only one that have >>> formally studied property-path semantics according to the current >>> specification, and that have shown negative results about the complexity >>> of evaluating it. We are aware that Katja Losemann and Wim Martens >>> obtained similar results independently from us. Wim Martens gave a >>> talk about this called "The complexity of evaluating path expressions >>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also >>> studied property-path expressions of the form :P{m,n}, and show that >>> the complexity of evaluating them is very high. >>> >>> We think that we have provided substantial new information to >>> reconsider the issue of property path semantics. We have several other >>> comments, but we think that the two comments above are the most >>> important to consider, and we are open to continue the discussion with >>> the group and, if necessary, cooperate with the group to make a proposal >>> for property path evaluation that can have an efficient evaluation >>> method. >>> >>> Best regards, >>> >>> Marcelo Arenas >>> Sebastián Conca >>> Jorge Pérez >>> >>> >>> >Received on Sunday, 29 January 2012 19:59:11 GMT

*
This archive was generated by hypermail 2.3.1
: Tuesday, 26 March 2013 16:15:47 GMT
*