Re: summary on options for JP-4 Comment about the semantics of property paths

Complete lack of implementation experience?

Lee
On Feb 11, 2012 8:26 AM, "Axel Polleres" <axel.polleres@deri.org> wrote:

> I see some merits for option 2, particularly, because I am unclear about
> how optimizations can be defined in general
> for DISTINCT at arbitrary places, whereas it seems to be clear for
> DISTINCT around path expressions.
> So, if we know that now, why not fix it now?
>
> Axel
>
> On 10 Feb 2012, at 18:50, Lee Feigenbaum wrote:
>
> > I prefer option 1, and let's learn from whether implementers end up
> > implementing something like option 2 for future standardization.
> >
> > Lee
> >
> > On 2/9/2012 4:10 PM, Axel Polleres wrote:
> > > (in completion of ACTION-587)
> > >
> > > Dear all, as discussed in the last Telco, we have several options on
> how to proceed with addressing  comment JP-4 [1].
> > > If possible, I would like to get consensus on how to proceed here in
> the next Telco.
> > >
> > > In the previous Telco [2], we seemed to have consensus that we do not
> aim to switch the default behaviour from counting semantics to
> > > distinct paths.
> > >
> > > Now two possibilities to proceed were discussed:
> > >
> > > Option 1... keep everything as it is in the grammar, and explain which
> DISTINCT path subqueries can be optimized:
> > > As outlined in my email below, it might not be entirely trivial to
> argue in response to the comment that this
> > > would be equivalent to the JP-4 proposed semantics, I am not 100% sure
> whether/how to define a rewriting to
> > > wrap all path expressions into DISTINCT subqueries, such that it would
> be equivalent to their semantics
> > > (e.g. regarding bnode [] shortcuts).
> > >
> > > Option 2 ... add DISTINCT around paths: It seems that sticking to our
> intended semantics and allowing - orthogonally to their
> > > ALLPATHS keyword proposal the keyword DISTINCT( ) around path
> expressions switching to existential paths semantics would be
> > > equivalent to the JP-4 existential paths semantics as outlined in
> Section 7.1 of their paper, and thus optimizable.
> > >
> > > Unlike someone sees a 3rd alternative, I would like to propose to
> decide between those two options next time
> > > and proceed, discussion prior to the call on email would be
> appreciated.
> > >
> > > Option 2 might be easier to implement, but also requires us to go for
> another LC round, as it would change the grammar.
> > > I think, in case we skip PR and manage to republish very soon, we
> would still manage to stay within time limits, but I would
> > > also like to know the team contacts' opinion on that.
> > >
> > > best,
> > > Axel
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 1.
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html
> > > 2. http://www.w3.org/2009/sparql/meeting/2012-02-07#comments
> > >
> > > On 2 Feb 2012, at 22:01, Axel Polleres wrote:
> > >
> > >> To get the discussion going, my personal opinion on that one is as
> follows:
> > >>
> > >>   a) I also think this is new and relevant information.
> > >>   b) I think it would be important to point to the fact that a naive
> implementation
> > >>      of property paths may become very inefficient/blowing up on even
> relatively harmelsssly looking examples, and that Property PATHs
> > >>      wrapped into DISTINCT subqueries can be evaluated more
> efficiently ... I'd be even more than happy
> > >>      to point in an informal reference to their work, however I feel
> honestly very uncomfortable with their
> > >>      title.
> > >>   c) As for their conclusion, proposing a default semantics that uses
> distinct paths semantics,
> > >>      whereas a separate keyword ALL-PATHS would be indicating the
> current semantics:
> > >>      it seems that (looking at their results in Section 7.1) that
> this is orthogonally possible with our current semantics
> > >>      by just wrapping any TriplesSameSubjectPath containing a
> property path into a DISTINCT subquery.
> > >>      It seems their result in section 7.1 indicates something along
> these lines, but I need some help there:
> > >>      admittedly don't have a formal proof for this equivalence yet
> (to be cautious, I am not yet 100% clear how/whether there is
> > >>      any interference possible with bnodes within
> TriplesSameSubjectPath and duplicates coming from those bnodes)
> > >>
> > >> This all said, I unfortunately haven't had the time yet to check all
> their claims in all detail.
> > >>
> > >> Axel
> > >>
> > >>
> > >> On 29 Jan 2012, at 20:58, Lee Feigenbaum wrote:
> > >>> [...]
> > >>> I do see new information in this latest comment, so I'd like to
> discuss
> > >>> it, as it is high priority to know whether it might affect our
> schedule.
> > >>>
> > >>> Lee
> > >>>
> > >>>> End of last call for query is Feb 6th - I don't expect we will be
> > >>>> replying until after that point anyway, but, if there's time, we can
> > >>>> make a start discussing it.
> > >>>>
> > >>>> Andy
> > >>>>
> > >>>> On 29/01/12 19:03, Lee Feigenbaum wrote:
> > >>>>> I'd suggest we discuss on our call this week.
> > >>>>>
> > >>>>> Lee
> > >>>>>
> > >>>>> On 1/28/2012 12:38 PM, Andy Seaborne wrote:
> > >>>>>> A previous conversation of these issues includes:
> > >>>>>>
> > >>>>>>
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> where the WG points out:
> > >>>>>> 1/ That aggregation may yield confusing results to a natural query
> > >>>>>> 2/ That an optimizer may be given further information via a
> sub-query
> > >>>>>> and DISTINCT.
> > >>>>>>
> > >>>>>> The purchase order example seems to me to be a reasonable
> expectation of
> > >>>>>> any spreadsheet user. The same price is arrived at several times
> (two
> > >>>>>> ways: via two :item1's and two uses of the same literal 2).
> > >>>>>>
> > >>>>>> There are two sets of use cases: one set where duplicates are
> essential,
> > >>>>>> and one set where they are redundant.
> > >>>>>>
> > >>>>>> Jorge's reply then:
> > >>>>>>
> > >>>>>>
> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> acknowledged the points in the response.
> > >>>>>>
> > >>>>>> Andy
> > >>>>>>
> > >>>>>> -------- Original Message --------
> > >>>>>> Subject: Comments about the semantics of property paths
> > >>>>>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000
> > >>>>>> Resent-From: public-rdf-dawg-comments@w3.org
> > >>>>>> Date: Fri, 27 Jan 2012 11:51:45 -0300
> > >>>>>> From: jorge perez<jorge.perez.rojas@gmail.com>
> > >>>>>> To: public-rdf-dawg-comments@w3.org
> > >>>>>> CC: Marcelo Arenas<marcelo.arenas1@gmail.com>, Sebastián Conca
> > >>>>>> <sconca87@gmail.com>, jorge perez<jorge.perez.rojas@gmail.com>
> > >>>>>>
> > >>>>>> Dear DAWG members,
> > >>>>>>
> > >>>>>> We have some comments regarding the semantics of property paths.
> We
> > >>>>>> know that this issue has been raised before, but we think that we
> can
> > >>>>>> provide substantial new information to reconsider it.
> > >>>>>>
> > >>>>>> We have conducted a thorough study of the current semantics of
> > >>>>>> property paths including an empirical analysis. All our results
> are in
> > >>>>>> a paper that has been accepted in WWW 2012. You can find a
> > >>>>>> copy of the extended version of this paper at
> > >>>>>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight
> > >>>>>> schedule of the group, we think that it might be useful to make
> these
> > >>>>>> results public for the group before we have a final published
> version.
> > >>>>>>
> > >>>>>> As a summary we can provide two main comments, one from a
> practical
> > >>>>>> perspective and another from a theoretical perspective.
> > >>>>>>
> > >>>>>> -----------
> > >>>>>>
> > >>>>>> - Comment 1: Poor performance of current engines.
> > >>>>>> =================================================
> > >>>>>> We tested 4 implementations of property paths: Jena, RDF::Query,
> > >>>>>> Sesame, and KGram-Corese (details on the experimental setting can
> be
> > >>>>>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A
> first
> > >>>>>> set of experiments was with synthetic data and other with real
> data.
> > >>>>>>
> > >>>>>> In both cases the implementations were not capable to handle even
> > >>>>>> small data for the most simple property path queries.
> > >>>>>>
> > >>>>>> Case A)
> > >>>>>> We tested RDF data representing complete graphs. No
> implementation was
> > >>>>>> able to handle a graph with 13 nodes for a query with a single
> > >>>>>> property path of the form (:P)*
> > >>>>>>
> > >>>>>> data1:
> http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3
> > >>>>>> query1:
> http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq
> > >>>>>>
> > >>>>>> See Figure 1 in the paper for the performance of all
> implementations
> > >>>>>> below 13 nodes. The figure suggests that the evaluation time for
> these
> > >>>>>> implementations growths doubly-exponentially w.r.t. the size of
> the
> > >>>>>> input.
> > >>>>>>
> > >>>>>> Case B)
> > >>>>>> We tested real RDF data crawled from a small set of foaf
> documents. We
> > >>>>>> started from Axel's foaf document and retrieve friends, friends of
> > >>>>>> friends, etc. following foaf:knows links, and constructed several
> test
> > >>>>>> cases. In this case, no implementation was able to handle an RDF
> graph
> > >>>>>> of 14KB for a query with a single property path (foaf:knows)*
> > >>>>>>
> > >>>>>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3
> > >>>>>> query2:
> http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
> > >>>>>>
> > >>>>>> See Table 1 in the paper for the performance of all
> implementations.
> > >>>>>>
> > >>>>>> - Comment 2: High Computational Complexity.
> > >>>>>> ===========================================
> > >>>>>> We prove in the paper that for the current semantics of property
> paths
> > >>>>>> in SPARQL the complexity of evaluation is double-exponential
> (Lemma
> > >>>>>> 5.4 and Theorem 5.5). Given that property paths require counting
> > >>>>>> paths, we measure the complexity by making use of counting
> complexity
> > >>>>>> classes. The technical result is that SPARQL 1.1 evaluation is not
> > >>>>>> even inside #P (Theorem 5.5), where #P is the counting complexity
> > >>>>>> class associated to NP (a prototypical #P-complete problem is the
> > >>>>>> problem
> > >>>>>> of computing the number of truth assignments that satisfies a
> > >>>>>> propositional
> > >>>>>> formula, which is more complicated than the prototypical
> NP-complete
> > >>>>>> problem
> > >>>>>> which is to verify whether there exists at least one truth
> assignment
> > >>>>>> that
> > >>>>>> satisfies a propositional formula). Thus, in informal terms, we
> prove
> > >>>>>> that
> > >>>>>> SPARQL 1.1 evaluation considering counting is even more complex
> than
> > >>>>>> solving an NP-complete problem.
> > >>>>>>
> > >>>>>> We also prove that if only the input data is considered to
> measure the
> > >>>>>> complexity of the problem, then the evaluation problem is #P-hard.
> > >>>>>> Notice that without property paths, the evaluation problem for
> SPARQL
> > >>>>>> can be solved in polynomial time (if the complexity is measured
> only in
> > >>>>>> terms of the size of the data).
> > >>>>>>
> > >>>>>> ------------
> > >>>>>>
> > >>>>>> Discussion
> > >>>>>> ==========
> > >>>>>>
> > >>>>>> One of the main conclusions that one can draw from our results is
> that
> > >>>>>> the poor performance exhibited in Cases A) and B) above is not a
> > >>>>>> problem of the particular implementations but a problem of the
> > >>>>>> specification itself, as our theoretical results imply that every
> > >>>>>> implementation that follows the current specification of SPARQL
> 1.1
> > >>>>>> would have the same poor behavior.
> > >>>>>>
> > >>>>>> Our results also show that the main source of complexity is the
> > >>>>>> requirement of counting paths, and in particular the procedure ALP
> > >>>>>> which is the one that gives the semantics for counting.
> Essentially,
> > >>>>>> the counting mechanism produces a number of duplicates that in
> some
> > >>>>>> cases are beyond any naturally feasible number. Table 7 in the
> paper
> > >>>>>> shows a worst case analysis. For instance, for the case
> > >>>>>>
> > >>>>>> data3:
> http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3
> > >>>>>> query3:
> http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq
> > >>>>>>
> > >>>>>> we formally prove that any implementation that follows the current
> > >>>>>> specification should produce an output of 79 Yottabytes (79
> trillion
> > >>>>>> Terabytes), and thus would not fit in any reasonable storage
> device.
> > >>>>>> Please notice that unfeasible counting can also be obtained with
> real
> > >>>>>> data. For example, for the case
> > >>>>>>
> > >>>>>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3
> > >>>>>> query2:
> http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
> > >>>>>>
> > >>>>>> ARQ (which was the only implementation that was able to handle
> this
> > >>>>>> case in less than one hour) produced an output of 587MB. Notice
> that
> > >>>>>> data3 is of only 13.2KB. Table 6 in the paper shows the running
> time
> > >>>>>> and the output size. Please notice that this experiment is with
> real
> > >>>>>> data and it is highly probable that property paths will be used in
> > >>>>>> practice with this type of queries.
> > >>>>>>
> > >>>>>> It is worth mentioning that our group is not the only one that
> have
> > >>>>>> formally studied property-path semantics according to the current
> > >>>>>> specification, and that have shown negative results about the
> complexity
> > >>>>>> of evaluating it. We are aware that Katja Losemann and Wim Martens
> > >>>>>> obtained similar results independently from us. Wim Martens gave a
> > >>>>>> talk about this called "The complexity of evaluating path
> expressions
> > >>>>>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also
> > >>>>>> studied property-path expressions of the form :P{m,n}, and show
> that
> > >>>>>> the complexity of evaluating them is very high.
> > >>>>>>
> > >>>>>> We think that we have provided substantial new information to
> > >>>>>> reconsider the issue of property path semantics. We have several
> other
> > >>>>>> comments, but we think that the two comments above are the most
> > >>>>>> important to consider, and we are open to continue the discussion
> with
> > >>>>>> the group and, if necessary, cooperate with the group to make a
> proposal
> > >>>>>> for property path evaluation that can have an efficient evaluation
> > >>>>>> method.
> > >>>>>>
> > >>>>>> Best regards,
> > >>>>>>
> > >>>>>> Marcelo Arenas
> > >>>>>> Sebastián Conca
> > >>>>>> Jorge Pérez
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
>
>

Received on Saturday, 11 February 2012 14:05:25 UTC