Re: summary on options for JP-4 Comment about the semantics of property paths from Axel Polleres on 2012-02-11 (public-rdf-dawg@w3.org from January to March 2012)

From: Axel Polleres <axel.polleres@deri.org>
Date: Sat, 11 Feb 2012 14:26:35 +0100
To: Lee Feigenbaum <lee@thefigtrees.net>
Cc: "SPARQL Working Group" <public-rdf-dawg@w3.org>
Message-Id: <5D9E7774-5E1A-4E9C-8ACB-4512201550B1@deri.org>
I see some merits for option 2, particularly, because I am unclear about how optimizations can be defined in general
for DISTINCT at arbitrary places, whereas it seems to be clear for DISTINCT around path expressions.
So, if we know that now, why not fix it now?

Axel

On 10 Feb 2012, at 18:50, Lee Feigenbaum wrote:

> I prefer option 1, and let's learn from whether implementers end up
> implementing something like option 2 for future standardization.
> 
> Lee
> 
> On 2/9/2012 4:10 PM, Axel Polleres wrote:
> > (in completion of ACTION-587)
> >
> > Dear all, as discussed in the last Telco, we have several options on how to proceed with addressing  comment JP-4 [1].
> > If possible, I would like to get consensus on how to proceed here in the next Telco.
> >
> > In the previous Telco [2], we seemed to have consensus that we do not aim to switch the default behaviour from counting semantics to
> > distinct paths.
> >
> > Now two possibilities to proceed were discussed:
> >
> > Option 1... keep everything as it is in the grammar, and explain which DISTINCT path subqueries can be optimized:
> > As outlined in my email below, it might not be entirely trivial to argue in response to the comment that this
> > would be equivalent to the JP-4 proposed semantics, I am not 100% sure whether/how to define a rewriting to
> > wrap all path expressions into DISTINCT subqueries, such that it would be equivalent to their semantics
> > (e.g. regarding bnode [] shortcuts).
> >
> > Option 2 ... add DISTINCT around paths: It seems that sticking to our intended semantics and allowing - orthogonally to their
> > ALLPATHS keyword proposal the keyword DISTINCT( ) around path expressions switching to existential paths semantics would be
> > equivalent to the JP-4 existential paths semantics as outlined in Section 7.1 of their paper, and thus optimizable.
> >
> > Unlike someone sees a 3rd alternative, I would like to propose to decide between those two options next time
> > and proceed, discussion prior to the call on email would be appreciated.
> >
> > Option 2 might be easier to implement, but also requires us to go for another LC round, as it would change the grammar.
> > I think, in case we skip PR and manage to republish very soon, we would still manage to stay within time limits, but I would
> > also like to know the team contacts' opinion on that.
> >
> > best,
> > Axel
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 1. http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html
> > 2. http://www.w3.org/2009/sparql/meeting/2012-02-07#comments
> >
> > On 2 Feb 2012, at 22:01, Axel Polleres wrote:
> >
> >> To get the discussion going, my personal opinion on that one is as follows:
> >>
> >>   a) I also think this is new and relevant information.
> >>   b) I think it would be important to point to the fact that a naive implementation
> >>      of property paths may become very inefficient/blowing up on even relatively harmelsssly looking examples, and that Property PATHs
> >>      wrapped into DISTINCT subqueries can be evaluated more efficiently ... I'd be even more than happy
> >>      to point in an informal reference to their work, however I feel honestly very uncomfortable with their
> >>      title.
> >>   c) As for their conclusion, proposing a default semantics that uses distinct paths semantics,
> >>      whereas a separate keyword ALL-PATHS would be indicating the current semantics:
> >>      it seems that (looking at their results in Section 7.1) that this is orthogonally possible with our current semantics
> >>      by just wrapping any TriplesSameSubjectPath containing a property path into a DISTINCT subquery.
> >>      It seems their result in section 7.1 indicates something along these lines, but I need some help there:
> >>      admittedly don't have a formal proof for this equivalence yet (to be cautious, I am not yet 100% clear how/whether there is
> >>      any interference possible with bnodes within TriplesSameSubjectPath and duplicates coming from those bnodes)
> >>
> >> This all said, I unfortunately haven't had the time yet to check all their claims in all detail.
> >>
> >> Axel
> >>
> >>
> >> On 29 Jan 2012, at 20:58, Lee Feigenbaum wrote:
> >>> [...]
> >>> I do see new information in this latest comment, so I'd like to discuss
> >>> it, as it is high priority to know whether it might affect our schedule.
> >>>
> >>> Lee
> >>>
> >>>> End of last call for query is Feb 6th - I don't expect we will be
> >>>> replying until after that point anyway, but, if there's time, we can
> >>>> make a start discussing it.
> >>>>
> >>>> Andy
> >>>>
> >>>> On 29/01/12 19:03, Lee Feigenbaum wrote:
> >>>>> I'd suggest we discuss on our call this week.
> >>>>>
> >>>>> Lee
> >>>>>
> >>>>> On 1/28/2012 12:38 PM, Andy Seaborne wrote:
> >>>>>> A previous conversation of these issues includes:
> >>>>>>
> >>>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> where the WG points out:
> >>>>>> 1/ That aggregation may yield confusing results to a natural query
> >>>>>> 2/ That an optimizer may be given further information via a sub-query
> >>>>>> and DISTINCT.
> >>>>>>
> >>>>>> The purchase order example seems to me to be a reasonable expectation of
> >>>>>> any spreadsheet user. The same price is arrived at several times (two
> >>>>>> ways: via two :item1's and two uses of the same literal 2).
> >>>>>>
> >>>>>> There are two sets of use cases: one set where duplicates are essential,
> >>>>>> and one set where they are redundant.
> >>>>>>
> >>>>>> Jorge's reply then:
> >>>>>>
> >>>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> acknowledged the points in the response.
> >>>>>>
> >>>>>> Andy
> >>>>>>
> >>>>>> -------- Original Message --------
> >>>>>> Subject: Comments about the semantics of property paths
> >>>>>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000
> >>>>>> Resent-From: public-rdf-dawg-comments@w3.org
> >>>>>> Date: Fri, 27 Jan 2012 11:51:45 -0300
> >>>>>> From: jorge perez<jorge.perez.rojas@gmail.com>
> >>>>>> To: public-rdf-dawg-comments@w3.org
> >>>>>> CC: Marcelo Arenas<marcelo.arenas1@gmail.com>, Sebastián Conca
> >>>>>> <sconca87@gmail.com>, jorge perez<jorge.perez.rojas@gmail.com>
> >>>>>>
> >>>>>> Dear DAWG members,
> >>>>>>
> >>>>>> We have some comments regarding the semantics of property paths. We
> >>>>>> know that this issue has been raised before, but we think that we can
> >>>>>> provide substantial new information to reconsider it.
> >>>>>>
> >>>>>> We have conducted a thorough study of the current semantics of
> >>>>>> property paths including an empirical analysis. All our results are in
> >>>>>> a paper that has been accepted in WWW 2012. You can find a
> >>>>>> copy of the extended version of this paper at
> >>>>>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight
> >>>>>> schedule of the group, we think that it might be useful to make these
> >>>>>> results public for the group before we have a final published version.
> >>>>>>
> >>>>>> As a summary we can provide two main comments, one from a practical
> >>>>>> perspective and another from a theoretical perspective.
> >>>>>>
> >>>>>> -----------
> >>>>>>
> >>>>>> - Comment 1: Poor performance of current engines.
> >>>>>> =================================================
> >>>>>> We tested 4 implementations of property paths: Jena, RDF::Query,
> >>>>>> Sesame, and KGram-Corese (details on the experimental setting can be
> >>>>>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first
> >>>>>> set of experiments was with synthetic data and other with real data.
> >>>>>>
> >>>>>> In both cases the implementations were not capable to handle even
> >>>>>> small data for the most simple property path queries.
> >>>>>>
> >>>>>> Case A)
> >>>>>> We tested RDF data representing complete graphs. No implementation was
> >>>>>> able to handle a graph with 13 nodes for a query with a single
> >>>>>> property path of the form (:P)*
> >>>>>>
> >>>>>> data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3
> >>>>>> query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq
> >>>>>>
> >>>>>> See Figure 1 in the paper for the performance of all implementations
> >>>>>> below 13 nodes. The figure suggests that the evaluation time for these
> >>>>>> implementations growths doubly-exponentially w.r.t. the size of the
> >>>>>> input.
> >>>>>>
> >>>>>> Case B)
> >>>>>> We tested real RDF data crawled from a small set of foaf documents. We
> >>>>>> started from Axel's foaf document and retrieve friends, friends of
> >>>>>> friends, etc. following foaf:knows links, and constructed several test
> >>>>>> cases. In this case, no implementation was able to handle an RDF graph
> >>>>>> of 14KB for a query with a single property path (foaf:knows)*
> >>>>>>
> >>>>>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3
> >>>>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
> >>>>>>
> >>>>>> See Table 1 in the paper for the performance of all implementations.
> >>>>>>
> >>>>>> - Comment 2: High Computational Complexity.
> >>>>>> ===========================================
> >>>>>> We prove in the paper that for the current semantics of property paths
> >>>>>> in SPARQL the complexity of evaluation is double-exponential (Lemma
> >>>>>> 5.4 and Theorem 5.5). Given that property paths require counting
> >>>>>> paths, we measure the complexity by making use of counting complexity
> >>>>>> classes. The technical result is that SPARQL 1.1 evaluation is not
> >>>>>> even inside #P (Theorem 5.5), where #P is the counting complexity
> >>>>>> class associated to NP (a prototypical #P-complete problem is the
> >>>>>> problem
> >>>>>> of computing the number of truth assignments that satisfies a
> >>>>>> propositional
> >>>>>> formula, which is more complicated than the prototypical NP-complete
> >>>>>> problem
> >>>>>> which is to verify whether there exists at least one truth assignment
> >>>>>> that
> >>>>>> satisfies a propositional formula). Thus, in informal terms, we prove
> >>>>>> that
> >>>>>> SPARQL 1.1 evaluation considering counting is even more complex than
> >>>>>> solving an NP-complete problem.
> >>>>>>
> >>>>>> We also prove that if only the input data is considered to measure the
> >>>>>> complexity of the problem, then the evaluation problem is #P-hard.
> >>>>>> Notice that without property paths, the evaluation problem for SPARQL
> >>>>>> can be solved in polynomial time (if the complexity is measured only in
> >>>>>> terms of the size of the data).
> >>>>>>
> >>>>>> ------------
> >>>>>>
> >>>>>> Discussion
> >>>>>> ==========
> >>>>>>
> >>>>>> One of the main conclusions that one can draw from our results is that
> >>>>>> the poor performance exhibited in Cases A) and B) above is not a
> >>>>>> problem of the particular implementations but a problem of the
> >>>>>> specification itself, as our theoretical results imply that every
> >>>>>> implementation that follows the current specification of SPARQL 1.1
> >>>>>> would have the same poor behavior.
> >>>>>>
> >>>>>> Our results also show that the main source of complexity is the
> >>>>>> requirement of counting paths, and in particular the procedure ALP
> >>>>>> which is the one that gives the semantics for counting. Essentially,
> >>>>>> the counting mechanism produces a number of duplicates that in some
> >>>>>> cases are beyond any naturally feasible number. Table 7 in the paper
> >>>>>> shows a worst case analysis. For instance, for the case
> >>>>>>
> >>>>>> data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3
> >>>>>> query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq
> >>>>>>
> >>>>>> we formally prove that any implementation that follows the current
> >>>>>> specification should produce an output of 79 Yottabytes (79 trillion
> >>>>>> Terabytes), and thus would not fit in any reasonable storage device.
> >>>>>> Please notice that unfeasible counting can also be obtained with real
> >>>>>> data. For example, for the case
> >>>>>>
> >>>>>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3
> >>>>>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
> >>>>>>
> >>>>>> ARQ (which was the only implementation that was able to handle this
> >>>>>> case in less than one hour) produced an output of 587MB. Notice that
> >>>>>> data3 is of only 13.2KB. Table 6 in the paper shows the running time
> >>>>>> and the output size. Please notice that this experiment is with real
> >>>>>> data and it is highly probable that property paths will be used in
> >>>>>> practice with this type of queries.
> >>>>>>
> >>>>>> It is worth mentioning that our group is not the only one that have
> >>>>>> formally studied property-path semantics according to the current
> >>>>>> specification, and that have shown negative results about the complexity
> >>>>>> of evaluating it. We are aware that Katja Losemann and Wim Martens
> >>>>>> obtained similar results independently from us. Wim Martens gave a
> >>>>>> talk about this called "The complexity of evaluating path expressions
> >>>>>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also
> >>>>>> studied property-path expressions of the form :P{m,n}, and show that
> >>>>>> the complexity of evaluating them is very high.
> >>>>>>
> >>>>>> We think that we have provided substantial new information to
> >>>>>> reconsider the issue of property path semantics. We have several other
> >>>>>> comments, but we think that the two comments above are the most
> >>>>>> important to consider, and we are open to continue the discussion with
> >>>>>> the group and, if necessary, cooperate with the group to make a proposal
> >>>>>> for property path evaluation that can have an efficient evaluation
> >>>>>> method.
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Marcelo Arenas
> >>>>>> Sebastián Conca
> >>>>>> Jorge Pérez
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> >
>
Received on Saturday, 11 February 2012 13:27:07 UTC