- From: Lee Feigenbaum <lee@thefigtrees.net>
- Date: Sat, 11 Feb 2012 09:04:57 -0500
- To: Axel Polleres <axel.polleres@deri.org>
- Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
- Message-ID: <CADSkEeAr1E5=09JGW1hg6Xwkr5A4jdbMmcocx4aneDYkaADMcQ@mail.gmail.com>
Complete lack of implementation experience? Lee On Feb 11, 2012 8:26 AM, "Axel Polleres" <axel.polleres@deri.org> wrote: > I see some merits for option 2, particularly, because I am unclear about > how optimizations can be defined in general > for DISTINCT at arbitrary places, whereas it seems to be clear for > DISTINCT around path expressions. > So, if we know that now, why not fix it now? > > Axel > > On 10 Feb 2012, at 18:50, Lee Feigenbaum wrote: > > > I prefer option 1, and let's learn from whether implementers end up > > implementing something like option 2 for future standardization. > > > > Lee > > > > On 2/9/2012 4:10 PM, Axel Polleres wrote: > > > (in completion of ACTION-587) > > > > > > Dear all, as discussed in the last Telco, we have several options on > how to proceed with addressing comment JP-4 [1]. > > > If possible, I would like to get consensus on how to proceed here in > the next Telco. > > > > > > In the previous Telco [2], we seemed to have consensus that we do not > aim to switch the default behaviour from counting semantics to > > > distinct paths. > > > > > > Now two possibilities to proceed were discussed: > > > > > > Option 1... keep everything as it is in the grammar, and explain which > DISTINCT path subqueries can be optimized: > > > As outlined in my email below, it might not be entirely trivial to > argue in response to the comment that this > > > would be equivalent to the JP-4 proposed semantics, I am not 100% sure > whether/how to define a rewriting to > > > wrap all path expressions into DISTINCT subqueries, such that it would > be equivalent to their semantics > > > (e.g. regarding bnode [] shortcuts). > > > > > > Option 2 ... add DISTINCT around paths: It seems that sticking to our > intended semantics and allowing - orthogonally to their > > > ALLPATHS keyword proposal the keyword DISTINCT( ) around path > expressions switching to existential paths semantics would be > > > equivalent to the JP-4 existential paths semantics as outlined in > Section 7.1 of their paper, and thus optimizable. > > > > > > Unlike someone sees a 3rd alternative, I would like to propose to > decide between those two options next time > > > and proceed, discussion prior to the call on email would be > appreciated. > > > > > > Option 2 might be easier to implement, but also requires us to go for > another LC round, as it would change the grammar. > > > I think, in case we skip PR and manage to republish very soon, we > would still manage to stay within time limits, but I would > > > also like to know the team contacts' opinion on that. > > > > > > best, > > > Axel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html > > > 2. http://www.w3.org/2009/sparql/meeting/2012-02-07#comments > > > > > > On 2 Feb 2012, at 22:01, Axel Polleres wrote: > > > > > >> To get the discussion going, my personal opinion on that one is as > follows: > > >> > > >> a) I also think this is new and relevant information. > > >> b) I think it would be important to point to the fact that a naive > implementation > > >> of property paths may become very inefficient/blowing up on even > relatively harmelsssly looking examples, and that Property PATHs > > >> wrapped into DISTINCT subqueries can be evaluated more > efficiently ... I'd be even more than happy > > >> to point in an informal reference to their work, however I feel > honestly very uncomfortable with their > > >> title. > > >> c) As for their conclusion, proposing a default semantics that uses > distinct paths semantics, > > >> whereas a separate keyword ALL-PATHS would be indicating the > current semantics: > > >> it seems that (looking at their results in Section 7.1) that > this is orthogonally possible with our current semantics > > >> by just wrapping any TriplesSameSubjectPath containing a > property path into a DISTINCT subquery. > > >> It seems their result in section 7.1 indicates something along > these lines, but I need some help there: > > >> admittedly don't have a formal proof for this equivalence yet > (to be cautious, I am not yet 100% clear how/whether there is > > >> any interference possible with bnodes within > TriplesSameSubjectPath and duplicates coming from those bnodes) > > >> > > >> This all said, I unfortunately haven't had the time yet to check all > their claims in all detail. > > >> > > >> Axel > > >> > > >> > > >> On 29 Jan 2012, at 20:58, Lee Feigenbaum wrote: > > >>> [...] > > >>> I do see new information in this latest comment, so I'd like to > discuss > > >>> it, as it is high priority to know whether it might affect our > schedule. > > >>> > > >>> Lee > > >>> > > >>>> End of last call for query is Feb 6th - I don't expect we will be > > >>>> replying until after that point anyway, but, if there's time, we can > > >>>> make a start discussing it. > > >>>> > > >>>> Andy > > >>>> > > >>>> On 29/01/12 19:03, Lee Feigenbaum wrote: > > >>>>> I'd suggest we discuss on our call this week. > > >>>>> > > >>>>> Lee > > >>>>> > > >>>>> On 1/28/2012 12:38 PM, Andy Seaborne wrote: > > >>>>>> A previous conversation of these issues includes: > > >>>>>> > > >>>>>> > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> where the WG points out: > > >>>>>> 1/ That aggregation may yield confusing results to a natural query > > >>>>>> 2/ That an optimizer may be given further information via a > sub-query > > >>>>>> and DISTINCT. > > >>>>>> > > >>>>>> The purchase order example seems to me to be a reasonable > expectation of > > >>>>>> any spreadsheet user. The same price is arrived at several times > (two > > >>>>>> ways: via two :item1's and two uses of the same literal 2). > > >>>>>> > > >>>>>> There are two sets of use cases: one set where duplicates are > essential, > > >>>>>> and one set where they are redundant. > > >>>>>> > > >>>>>> Jorge's reply then: > > >>>>>> > > >>>>>> > http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> acknowledged the points in the response. > > >>>>>> > > >>>>>> Andy > > >>>>>> > > >>>>>> -------- Original Message -------- > > >>>>>> Subject: Comments about the semantics of property paths > > >>>>>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000 > > >>>>>> Resent-From: public-rdf-dawg-comments@w3.org > > >>>>>> Date: Fri, 27 Jan 2012 11:51:45 -0300 > > >>>>>> From: jorge perez<jorge.perez.rojas@gmail.com> > > >>>>>> To: public-rdf-dawg-comments@w3.org > > >>>>>> CC: Marcelo Arenas<marcelo.arenas1@gmail.com>, Sebastián Conca > > >>>>>> <sconca87@gmail.com>, jorge perez<jorge.perez.rojas@gmail.com> > > >>>>>> > > >>>>>> Dear DAWG members, > > >>>>>> > > >>>>>> We have some comments regarding the semantics of property paths. > We > > >>>>>> know that this issue has been raised before, but we think that we > can > > >>>>>> provide substantial new information to reconsider it. > > >>>>>> > > >>>>>> We have conducted a thorough study of the current semantics of > > >>>>>> property paths including an empirical analysis. All our results > are in > > >>>>>> a paper that has been accepted in WWW 2012. You can find a > > >>>>>> copy of the extended version of this paper at > > >>>>>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight > > >>>>>> schedule of the group, we think that it might be useful to make > these > > >>>>>> results public for the group before we have a final published > version. > > >>>>>> > > >>>>>> As a summary we can provide two main comments, one from a > practical > > >>>>>> perspective and another from a theoretical perspective. > > >>>>>> > > >>>>>> ----------- > > >>>>>> > > >>>>>> - Comment 1: Poor performance of current engines. > > >>>>>> ================================================= > > >>>>>> We tested 4 implementations of property paths: Jena, RDF::Query, > > >>>>>> Sesame, and KGram-Corese (details on the experimental setting can > be > > >>>>>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A > first > > >>>>>> set of experiments was with synthetic data and other with real > data. > > >>>>>> > > >>>>>> In both cases the implementations were not capable to handle even > > >>>>>> small data for the most simple property path queries. > > >>>>>> > > >>>>>> Case A) > > >>>>>> We tested RDF data representing complete graphs. No > implementation was > > >>>>>> able to handle a graph with 13 nodes for a query with a single > > >>>>>> property path of the form (:P)* > > >>>>>> > > >>>>>> data1: > http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3 > > >>>>>> query1: > http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq > > >>>>>> > > >>>>>> See Figure 1 in the paper for the performance of all > implementations > > >>>>>> below 13 nodes. The figure suggests that the evaluation time for > these > > >>>>>> implementations growths doubly-exponentially w.r.t. the size of > the > > >>>>>> input. > > >>>>>> > > >>>>>> Case B) > > >>>>>> We tested real RDF data crawled from a small set of foaf > documents. We > > >>>>>> started from Axel's foaf document and retrieve friends, friends of > > >>>>>> friends, etc. following foaf:knows links, and constructed several > test > > >>>>>> cases. In this case, no implementation was able to handle an RDF > graph > > >>>>>> of 14KB for a query with a single property path (foaf:knows)* > > >>>>>> > > >>>>>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3 > > >>>>>> query2: > http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq > > >>>>>> > > >>>>>> See Table 1 in the paper for the performance of all > implementations. > > >>>>>> > > >>>>>> - Comment 2: High Computational Complexity. > > >>>>>> =========================================== > > >>>>>> We prove in the paper that for the current semantics of property > paths > > >>>>>> in SPARQL the complexity of evaluation is double-exponential > (Lemma > > >>>>>> 5.4 and Theorem 5.5). Given that property paths require counting > > >>>>>> paths, we measure the complexity by making use of counting > complexity > > >>>>>> classes. The technical result is that SPARQL 1.1 evaluation is not > > >>>>>> even inside #P (Theorem 5.5), where #P is the counting complexity > > >>>>>> class associated to NP (a prototypical #P-complete problem is the > > >>>>>> problem > > >>>>>> of computing the number of truth assignments that satisfies a > > >>>>>> propositional > > >>>>>> formula, which is more complicated than the prototypical > NP-complete > > >>>>>> problem > > >>>>>> which is to verify whether there exists at least one truth > assignment > > >>>>>> that > > >>>>>> satisfies a propositional formula). Thus, in informal terms, we > prove > > >>>>>> that > > >>>>>> SPARQL 1.1 evaluation considering counting is even more complex > than > > >>>>>> solving an NP-complete problem. > > >>>>>> > > >>>>>> We also prove that if only the input data is considered to > measure the > > >>>>>> complexity of the problem, then the evaluation problem is #P-hard. > > >>>>>> Notice that without property paths, the evaluation problem for > SPARQL > > >>>>>> can be solved in polynomial time (if the complexity is measured > only in > > >>>>>> terms of the size of the data). > > >>>>>> > > >>>>>> ------------ > > >>>>>> > > >>>>>> Discussion > > >>>>>> ========== > > >>>>>> > > >>>>>> One of the main conclusions that one can draw from our results is > that > > >>>>>> the poor performance exhibited in Cases A) and B) above is not a > > >>>>>> problem of the particular implementations but a problem of the > > >>>>>> specification itself, as our theoretical results imply that every > > >>>>>> implementation that follows the current specification of SPARQL > 1.1 > > >>>>>> would have the same poor behavior. > > >>>>>> > > >>>>>> Our results also show that the main source of complexity is the > > >>>>>> requirement of counting paths, and in particular the procedure ALP > > >>>>>> which is the one that gives the semantics for counting. > Essentially, > > >>>>>> the counting mechanism produces a number of duplicates that in > some > > >>>>>> cases are beyond any naturally feasible number. Table 7 in the > paper > > >>>>>> shows a worst case analysis. For instance, for the case > > >>>>>> > > >>>>>> data3: > http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3 > > >>>>>> query3: > http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq > > >>>>>> > > >>>>>> we formally prove that any implementation that follows the current > > >>>>>> specification should produce an output of 79 Yottabytes (79 > trillion > > >>>>>> Terabytes), and thus would not fit in any reasonable storage > device. > > >>>>>> Please notice that unfeasible counting can also be obtained with > real > > >>>>>> data. For example, for the case > > >>>>>> > > >>>>>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3 > > >>>>>> query2: > http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq > > >>>>>> > > >>>>>> ARQ (which was the only implementation that was able to handle > this > > >>>>>> case in less than one hour) produced an output of 587MB. Notice > that > > >>>>>> data3 is of only 13.2KB. Table 6 in the paper shows the running > time > > >>>>>> and the output size. Please notice that this experiment is with > real > > >>>>>> data and it is highly probable that property paths will be used in > > >>>>>> practice with this type of queries. > > >>>>>> > > >>>>>> It is worth mentioning that our group is not the only one that > have > > >>>>>> formally studied property-path semantics according to the current > > >>>>>> specification, and that have shown negative results about the > complexity > > >>>>>> of evaluating it. We are aware that Katja Losemann and Wim Martens > > >>>>>> obtained similar results independently from us. Wim Martens gave a > > >>>>>> talk about this called "The complexity of evaluating path > expressions > > >>>>>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also > > >>>>>> studied property-path expressions of the form :P{m,n}, and show > that > > >>>>>> the complexity of evaluating them is very high. > > >>>>>> > > >>>>>> We think that we have provided substantial new information to > > >>>>>> reconsider the issue of property path semantics. We have several > other > > >>>>>> comments, but we think that the two comments above are the most > > >>>>>> important to consider, and we are open to continue the discussion > with > > >>>>>> the group and, if necessary, cooperate with the group to make a > proposal > > >>>>>> for property path evaluation that can have an efficient evaluation > > >>>>>> method. > > >>>>>> > > >>>>>> Best regards, > > >>>>>> > > >>>>>> Marcelo Arenas > > >>>>>> Sebastián Conca > > >>>>>> Jorge Pérez > > >>>>>> > > >>>>>> > > >>>>>> > > >>>> > > >>> > > >>> > > >> > > >> > > >> > > > > > > > > > > > > >
Received on Saturday, 11 February 2012 14:05:25 UTC