From: Lee Feigenbaum <lee@thefigtrees.net>

Date: Fri, 10 Feb 2012 12:50:40 -0500

Message-ID: <4F3558F0.50208@thefigtrees.net>

To: Axel Polleres <axel.polleres@deri.org>

CC: SPARQL Working Group <public-rdf-dawg@w3.org>

Date: Fri, 10 Feb 2012 12:50:40 -0500

Message-ID: <4F3558F0.50208@thefigtrees.net>

To: Axel Polleres <axel.polleres@deri.org>

CC: SPARQL Working Group <public-rdf-dawg@w3.org>

I prefer option 1, and let's learn from whether implementers end up implementing something like option 2 for future standardization. Lee On 2/9/2012 4:10 PM, Axel Polleres wrote: > (in completion of ACTION-587) > > Dear all, as discussed in the last Telco, we have several options on how to proceed with addressing comment JP-4 [1]. > If possible, I would like to get consensus on how to proceed here in the next Telco. > > In the previous Telco [2], we seemed to have consensus that we do not aim to switch the default behaviour from counting semantics to > distinct paths. > > Now two possibilities to proceed were discussed: > > Option 1... keep everything as it is in the grammar, and explain which DISTINCT path subqueries can be optimized: > As outlined in my email below, it might not be entirely trivial to argue in response to the comment that this > would be equivalent to the JP-4 proposed semantics, I am not 100% sure whether/how to define a rewriting to > wrap all path expressions into DISTINCT subqueries, such that it would be equivalent to their semantics > (e.g. regarding bnode [] shortcuts). > > Option 2 ... add DISTINCT around paths: It seems that sticking to our intended semantics and allowing - orthogonally to their > ALLPATHS keyword proposal the keyword DISTINCT( ) around path expressions switching to existential paths semantics would be > equivalent to the JP-4 existential paths semantics as outlined in Section 7.1 of their paper, and thus optimizable. > > Unlike someone sees a 3rd alternative, I would like to propose to decide between those two options next time > and proceed, discussion prior to the call on email would be appreciated. > > Option 2 might be easier to implement, but also requires us to go for another LC round, as it would change the grammar. > I think, in case we skip PR and manage to republish very soon, we would still manage to stay within time limits, but I would > also like to know the team contacts' opinion on that. > > best, > Axel > > > > > > > > > > > > > 1. http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html > 2. http://www.w3.org/2009/sparql/meeting/2012-02-07#comments > > On 2 Feb 2012, at 22:01, Axel Polleres wrote: > >> To get the discussion going, my personal opinion on that one is as follows: >> >> a) I also think this is new and relevant information. >> b) I think it would be important to point to the fact that a naive implementation >> of property paths may become very inefficient/blowing up on even relatively harmelsssly looking examples, and that Property PATHs >> wrapped into DISTINCT subqueries can be evaluated more efficiently ... I'd be even more than happy >> to point in an informal reference to their work, however I feel honestly very uncomfortable with their >> title. >> c) As for their conclusion, proposing a default semantics that uses distinct paths semantics, >> whereas a separate keyword ALL-PATHS would be indicating the current semantics: >> it seems that (looking at their results in Section 7.1) that this is orthogonally possible with our current semantics >> by just wrapping any TriplesSameSubjectPath containing a property path into a DISTINCT subquery. >> It seems their result in section 7.1 indicates something along these lines, but I need some help there: >> admittedly don't have a formal proof for this equivalence yet (to be cautious, I am not yet 100% clear how/whether there is >> any interference possible with bnodes within TriplesSameSubjectPath and duplicates coming from those bnodes) >> >> This all said, I unfortunately haven't had the time yet to check all their claims in all detail. >> >> Axel >> >> >> On 29 Jan 2012, at 20:58, Lee Feigenbaum wrote: >>> [...] >>> I do see new information in this latest comment, so I'd like to discuss >>> it, as it is high priority to know whether it might affect our schedule. >>> >>> Lee >>> >>>> End of last call for query is Feb 6th - I don't expect we will be >>>> replying until after that point anyway, but, if there's time, we can >>>> make a start discussing it. >>>> >>>> Andy >>>> >>>> On 29/01/12 19:03, Lee Feigenbaum wrote: >>>>> I'd suggest we discuss on our call this week. >>>>> >>>>> Lee >>>>> >>>>> On 1/28/2012 12:38 PM, Andy Seaborne wrote: >>>>>> A previous conversation of these issues includes: >>>>>> >>>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> where the WG points out: >>>>>> 1/ That aggregation may yield confusing results to a natural query >>>>>> 2/ That an optimizer may be given further information via a sub-query >>>>>> and DISTINCT. >>>>>> >>>>>> The purchase order example seems to me to be a reasonable expectation of >>>>>> any spreadsheet user. The same price is arrived at several times (two >>>>>> ways: via two :item1's and two uses of the same literal 2). >>>>>> >>>>>> There are two sets of use cases: one set where duplicates are essential, >>>>>> and one set where they are redundant. >>>>>> >>>>>> Jorge's reply then: >>>>>> >>>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> acknowledged the points in the response. >>>>>> >>>>>> Andy >>>>>> >>>>>> -------- Original Message -------- >>>>>> Subject: Comments about the semantics of property paths >>>>>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000 >>>>>> Resent-From: public-rdf-dawg-comments@w3.org >>>>>> Date: Fri, 27 Jan 2012 11:51:45 -0300 >>>>>> From: jorge perez<jorge.perez.rojas@gmail.com> >>>>>> To: public-rdf-dawg-comments@w3.org >>>>>> CC: Marcelo Arenas<marcelo.arenas1@gmail.com>, Sebastián Conca >>>>>> <sconca87@gmail.com>, jorge perez<jorge.perez.rojas@gmail.com> >>>>>> >>>>>> Dear DAWG members, >>>>>> >>>>>> We have some comments regarding the semantics of property paths. We >>>>>> know that this issue has been raised before, but we think that we can >>>>>> provide substantial new information to reconsider it. >>>>>> >>>>>> We have conducted a thorough study of the current semantics of >>>>>> property paths including an empirical analysis. All our results are in >>>>>> a paper that has been accepted in WWW 2012. You can find a >>>>>> copy of the extended version of this paper at >>>>>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight >>>>>> schedule of the group, we think that it might be useful to make these >>>>>> results public for the group before we have a final published version. >>>>>> >>>>>> As a summary we can provide two main comments, one from a practical >>>>>> perspective and another from a theoretical perspective. >>>>>> >>>>>> ----------- >>>>>> >>>>>> - Comment 1: Poor performance of current engines. >>>>>> ================================================= >>>>>> We tested 4 implementations of property paths: Jena, RDF::Query, >>>>>> Sesame, and KGram-Corese (details on the experimental setting can be >>>>>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first >>>>>> set of experiments was with synthetic data and other with real data. >>>>>> >>>>>> In both cases the implementations were not capable to handle even >>>>>> small data for the most simple property path queries. >>>>>> >>>>>> Case A) >>>>>> We tested RDF data representing complete graphs. No implementation was >>>>>> able to handle a graph with 13 nodes for a query with a single >>>>>> property path of the form (:P)* >>>>>> >>>>>> data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3 >>>>>> query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq >>>>>> >>>>>> See Figure 1 in the paper for the performance of all implementations >>>>>> below 13 nodes. The figure suggests that the evaluation time for these >>>>>> implementations growths doubly-exponentially w.r.t. the size of the >>>>>> input. >>>>>> >>>>>> Case B) >>>>>> We tested real RDF data crawled from a small set of foaf documents. We >>>>>> started from Axel's foaf document and retrieve friends, friends of >>>>>> friends, etc. following foaf:knows links, and constructed several test >>>>>> cases. In this case, no implementation was able to handle an RDF graph >>>>>> of 14KB for a query with a single property path (foaf:knows)* >>>>>> >>>>>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3 >>>>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq >>>>>> >>>>>> See Table 1 in the paper for the performance of all implementations. >>>>>> >>>>>> - Comment 2: High Computational Complexity. >>>>>> =========================================== >>>>>> We prove in the paper that for the current semantics of property paths >>>>>> in SPARQL the complexity of evaluation is double-exponential (Lemma >>>>>> 5.4 and Theorem 5.5). Given that property paths require counting >>>>>> paths, we measure the complexity by making use of counting complexity >>>>>> classes. The technical result is that SPARQL 1.1 evaluation is not >>>>>> even inside #P (Theorem 5.5), where #P is the counting complexity >>>>>> class associated to NP (a prototypical #P-complete problem is the >>>>>> problem >>>>>> of computing the number of truth assignments that satisfies a >>>>>> propositional >>>>>> formula, which is more complicated than the prototypical NP-complete >>>>>> problem >>>>>> which is to verify whether there exists at least one truth assignment >>>>>> that >>>>>> satisfies a propositional formula). Thus, in informal terms, we prove >>>>>> that >>>>>> SPARQL 1.1 evaluation considering counting is even more complex than >>>>>> solving an NP-complete problem. >>>>>> >>>>>> We also prove that if only the input data is considered to measure the >>>>>> complexity of the problem, then the evaluation problem is #P-hard. >>>>>> Notice that without property paths, the evaluation problem for SPARQL >>>>>> can be solved in polynomial time (if the complexity is measured only in >>>>>> terms of the size of the data). >>>>>> >>>>>> ------------ >>>>>> >>>>>> Discussion >>>>>> ========== >>>>>> >>>>>> One of the main conclusions that one can draw from our results is that >>>>>> the poor performance exhibited in Cases A) and B) above is not a >>>>>> problem of the particular implementations but a problem of the >>>>>> specification itself, as our theoretical results imply that every >>>>>> implementation that follows the current specification of SPARQL 1.1 >>>>>> would have the same poor behavior. >>>>>> >>>>>> Our results also show that the main source of complexity is the >>>>>> requirement of counting paths, and in particular the procedure ALP >>>>>> which is the one that gives the semantics for counting. Essentially, >>>>>> the counting mechanism produces a number of duplicates that in some >>>>>> cases are beyond any naturally feasible number. Table 7 in the paper >>>>>> shows a worst case analysis. For instance, for the case >>>>>> >>>>>> data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3 >>>>>> query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq >>>>>> >>>>>> we formally prove that any implementation that follows the current >>>>>> specification should produce an output of 79 Yottabytes (79 trillion >>>>>> Terabytes), and thus would not fit in any reasonable storage device. >>>>>> Please notice that unfeasible counting can also be obtained with real >>>>>> data. For example, for the case >>>>>> >>>>>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3 >>>>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq >>>>>> >>>>>> ARQ (which was the only implementation that was able to handle this >>>>>> case in less than one hour) produced an output of 587MB. Notice that >>>>>> data3 is of only 13.2KB. Table 6 in the paper shows the running time >>>>>> and the output size. Please notice that this experiment is with real >>>>>> data and it is highly probable that property paths will be used in >>>>>> practice with this type of queries. >>>>>> >>>>>> It is worth mentioning that our group is not the only one that have >>>>>> formally studied property-path semantics according to the current >>>>>> specification, and that have shown negative results about the complexity >>>>>> of evaluating it. We are aware that Katja Losemann and Wim Martens >>>>>> obtained similar results independently from us. Wim Martens gave a >>>>>> talk about this called "The complexity of evaluating path expressions >>>>>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also >>>>>> studied property-path expressions of the form :P{m,n}, and show that >>>>>> the complexity of evaluating them is very high. >>>>>> >>>>>> We think that we have provided substantial new information to >>>>>> reconsider the issue of property path semantics. We have several other >>>>>> comments, but we think that the two comments above are the most >>>>>> important to consider, and we are open to continue the discussion with >>>>>> the group and, if necessary, cooperate with the group to make a proposal >>>>>> for property path evaluation that can have an efficient evaluation >>>>>> method. >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Marcelo Arenas >>>>>> Sebastián Conca >>>>>> Jorge Pérez >>>>>> >>>>>> >>>>>> >>>> >>> >>> >> >> >> > > >Received on Friday, 10 February 2012 17:51:11 UTC

*
This archive was generated by hypermail 2.3.1
: Wednesday, 7 January 2015 15:01:06 UTC
*