summary on options for JP-4 Comment about the semantics of property paths from Axel Polleres on 2012-02-09 (public-rdf-dawg@w3.org from January to March 2012)

From: Axel Polleres <axel.polleres@deri.org>
Date: Thu, 9 Feb 2012 22:10:47 +0100
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <DE8DFE24-08E7-487C-B16D-7BAE7E26D3BB@deri.org>
(in completion of ACTION-587)

Dear all, as discussed in the last Telco, we have several options on how to proceed with addressing  comment JP-4 [1]. 
If possible, I would like to get consensus on how to proceed here in the next Telco.

In the previous Telco [2], we seemed to have consensus that we do not aim to switch the default behaviour from counting semantics to 
distinct paths.

Now two possibilities to proceed were discussed:

Option 1... keep everything as it is in the grammar, and explain which DISTINCT path subqueries can be optimized:
As outlined in my email below, it might not be entirely trivial to argue in response to the comment that this 
would be equivalent to the JP-4 proposed semantics, I am not 100% sure whether/how to define a rewriting to 
wrap all path expressions into DISTINCT subqueries, such that it would be equivalent to their semantics 
(e.g. regarding bnode [] shortcuts).

Option 2 ... add DISTINCT around paths: It seems that sticking to our intended semantics and allowing - orthogonally to their
ALLPATHS keyword proposal the keyword DISTINCT( ) around path expressions switching to existential paths semantics would be
equivalent to the JP-4 existential paths semantics as outlined in Section 7.1 of their paper, and thus optimizable.

Unlike someone sees a 3rd alternative, I would like to propose to decide between those two options next time 
and proceed, discussion prior to the call on email would be appreciated.

Option 2 might be easier to implement, but also requires us to go for another LC round, as it would change the grammar.
I think, in case we skip PR and manage to republish very soon, we would still manage to stay within time limits, but I would 
also like to know the team contacts' opinion on that.

best,
Axel












1. http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Jan/0009.html
2. http://www.w3.org/2009/sparql/meeting/2012-02-07#comments

On 2 Feb 2012, at 22:01, Axel Polleres wrote:

> To get the discussion going, my personal opinion on that one is as follows:
> 
>  a) I also think this is new and relevant information.
>  b) I think it would be important to point to the fact that a naive implementation
>     of property paths may become very inefficient/blowing up on even relatively harmelsssly looking examples, and that Property PATHs
>     wrapped into DISTINCT subqueries can be evaluated more efficiently ... I'd be even more than happy
>     to point in an informal reference to their work, however I feel honestly very uncomfortable with their
>     title.
>  c) As for their conclusion, proposing a default semantics that uses distinct paths semantics,
>     whereas a separate keyword ALL-PATHS would be indicating the current semantics:
>     it seems that (looking at their results in Section 7.1) that this is orthogonally possible with our current semantics
>     by just wrapping any TriplesSameSubjectPath containing a property path into a DISTINCT subquery.
>     It seems their result in section 7.1 indicates something along these lines, but I need some help there:
>     admittedly don't have a formal proof for this equivalence yet (to be cautious, I am not yet 100% clear how/whether there is
>     any interference possible with bnodes within TriplesSameSubjectPath and duplicates coming from those bnodes)
> 
> This all said, I unfortunately haven't had the time yet to check all their claims in all detail.
> 
> Axel
> 
> 
> On 29 Jan 2012, at 20:58, Lee Feigenbaum wrote:
> > [...]
> > I do see new information in this latest comment, so I'd like to discuss
> > it, as it is high priority to know whether it might affect our schedule.
> >
> > Lee
> >
> > > End of last call for query is Feb 6th - I don't expect we will be
> > > replying until after that point anyway, but, if there's time, we can
> > > make a start discussing it.
> > >
> > > Andy
> > >
> > > On 29/01/12 19:03, Lee Feigenbaum wrote:
> > >> I'd suggest we discuss on our call this week.
> > >>
> > >> Lee
> > >>
> > >> On 1/28/2012 12:38 PM, Andy Seaborne wrote:
> > >>> A previous conversation of these issues includes:
> > >>>
> > >>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0005.html
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> where the WG points out:
> > >>> 1/ That aggregation may yield confusing results to a natural query
> > >>> 2/ That an optimizer may be given further information via a sub-query
> > >>> and DISTINCT.
> > >>>
> > >>> The purchase order example seems to me to be a reasonable expectation of
> > >>> any spreadsheet user. The same price is arrived at several times (two
> > >>> ways: via two :item1's and two uses of the same literal 2).
> > >>>
> > >>> There are two sets of use cases: one set where duplicates are essential,
> > >>> and one set where they are redundant.
> > >>>
> > >>> Jorge's reply then:
> > >>>
> > >>> http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2011Feb/0012.html
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> acknowledged the points in the response.
> > >>>
> > >>> Andy
> > >>>
> > >>> -------- Original Message --------
> > >>> Subject: Comments about the semantics of property paths
> > >>> Resent-Date: Fri, 27 Jan 2012 14:52:22 +0000
> > >>> Resent-From: public-rdf-dawg-comments@w3.org
> > >>> Date: Fri, 27 Jan 2012 11:51:45 -0300
> > >>> From: jorge perez <jorge.perez.rojas@gmail.com>
> > >>> To: public-rdf-dawg-comments@w3.org
> > >>> CC: Marcelo Arenas <marcelo.arenas1@gmail.com>, Sebastián Conca
> > >>> <sconca87@gmail.com>, jorge perez <jorge.perez.rojas@gmail.com>
> > >>>
> > >>> Dear DAWG members,
> > >>>
> > >>> We have some comments regarding the semantics of property paths. We
> > >>> know that this issue has been raised before, but we think that we can
> > >>> provide substantial new information to reconsider it.
> > >>>
> > >>> We have conducted a thorough study of the current semantics of
> > >>> property paths including an empirical analysis. All our results are in
> > >>> a paper that has been accepted in WWW 2012. You can find a
> > >>> copy of the extended version of this paper at
> > >>> http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight
> > >>> schedule of the group, we think that it might be useful to make these
> > >>> results public for the group before we have a final published version.
> > >>>
> > >>> As a summary we can provide two main comments, one from a practical
> > >>> perspective and another from a theoretical perspective.
> > >>>
> > >>> -----------
> > >>>
> > >>> - Comment 1: Poor performance of current engines.
> > >>> =================================================
> > >>> We tested 4 implementations of property paths: Jena, RDF::Query,
> > >>> Sesame, and KGram-Corese (details on the experimental setting can be
> > >>> found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first
> > >>> set of experiments was with synthetic data and other with real data.
> > >>>
> > >>> In both cases the implementations were not capable to handle even
> > >>> small data for the most simple property path queries.
> > >>>
> > >>> Case A)
> > >>> We tested RDF data representing complete graphs. No implementation was
> > >>> able to handle a graph with 13 nodes for a query with a single
> > >>> property path of the form (:P)*
> > >>>
> > >>> data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3
> > >>> query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq
> > >>>
> > >>> See Figure 1 in the paper for the performance of all implementations
> > >>> below 13 nodes. The figure suggests that the evaluation time for these
> > >>> implementations growths doubly-exponentially w.r.t. the size of the
> > >>> input.
> > >>>
> > >>> Case B)
> > >>> We tested real RDF data crawled from a small set of foaf documents. We
> > >>> started from Axel's foaf document and retrieve friends, friends of
> > >>> friends, etc. following foaf:knows links, and constructed several test
> > >>> cases. In this case, no implementation was able to handle an RDF graph
> > >>> of 14KB for a query with a single property path (foaf:knows)*
> > >>>
> > >>> data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3
> > >>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
> > >>>
> > >>> See Table 1 in the paper for the performance of all implementations.
> > >>>
> > >>> - Comment 2: High Computational Complexity.
> > >>> ===========================================
> > >>> We prove in the paper that for the current semantics of property paths
> > >>> in SPARQL the complexity of evaluation is double-exponential (Lemma
> > >>> 5.4 and Theorem 5.5). Given that property paths require counting
> > >>> paths, we measure the complexity by making use of counting complexity
> > >>> classes. The technical result is that SPARQL 1.1 evaluation is not
> > >>> even inside #P (Theorem 5.5), where #P is the counting complexity
> > >>> class associated to NP (a prototypical #P-complete problem is the
> > >>> problem
> > >>> of computing the number of truth assignments that satisfies a
> > >>> propositional
> > >>> formula, which is more complicated than the prototypical NP-complete
> > >>> problem
> > >>> which is to verify whether there exists at least one truth assignment
> > >>> that
> > >>> satisfies a propositional formula). Thus, in informal terms, we prove
> > >>> that
> > >>> SPARQL 1.1 evaluation considering counting is even more complex than
> > >>> solving an NP-complete problem.
> > >>>
> > >>> We also prove that if only the input data is considered to measure the
> > >>> complexity of the problem, then the evaluation problem is #P-hard.
> > >>> Notice that without property paths, the evaluation problem for SPARQL
> > >>> can be solved in polynomial time (if the complexity is measured only in
> > >>> terms of the size of the data).
> > >>>
> > >>> ------------
> > >>>
> > >>> Discussion
> > >>> ==========
> > >>>
> > >>> One of the main conclusions that one can draw from our results is that
> > >>> the poor performance exhibited in Cases A) and B) above is not a
> > >>> problem of the particular implementations but a problem of the
> > >>> specification itself, as our theoretical results imply that every
> > >>> implementation that follows the current specification of SPARQL 1.1
> > >>> would have the same poor behavior.
> > >>>
> > >>> Our results also show that the main source of complexity is the
> > >>> requirement of counting paths, and in particular the procedure ALP
> > >>> which is the one that gives the semantics for counting. Essentially,
> > >>> the counting mechanism produces a number of duplicates that in some
> > >>> cases are beyond any naturally feasible number. Table 7 in the paper
> > >>> shows a worst case analysis. For instance, for the case
> > >>>
> > >>> data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3
> > >>> query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq
> > >>>
> > >>> we formally prove that any implementation that follows the current
> > >>> specification should produce an output of 79 Yottabytes (79 trillion
> > >>> Terabytes), and thus would not fit in any reasonable storage device.
> > >>> Please notice that unfeasible counting can also be obtained with real
> > >>> data. For example, for the case
> > >>>
> > >>> data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3
> > >>> query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq
> > >>>
> > >>> ARQ (which was the only implementation that was able to handle this
> > >>> case in less than one hour) produced an output of 587MB. Notice that
> > >>> data3 is of only 13.2KB. Table 6 in the paper shows the running time
> > >>> and the output size. Please notice that this experiment is with real
> > >>> data and it is highly probable that property paths will be used in
> > >>> practice with this type of queries.
> > >>>
> > >>> It is worth mentioning that our group is not the only one that have
> > >>> formally studied property-path semantics according to the current
> > >>> specification, and that have shown negative results about the complexity
> > >>> of evaluating it. We are aware that Katja Losemann and Wim Martens
> > >>> obtained similar results independently from us. Wim Martens gave a
> > >>> talk about this called "The complexity of evaluating path expressions
> > >>> in SPARQL" in a Dagstuhl seminar. In that work, the authors also
> > >>> studied property-path expressions of the form :P{m,n}, and show that
> > >>> the complexity of evaluating them is very high.
> > >>>
> > >>> We think that we have provided substantial new information to
> > >>> reconsider the issue of property path semantics. We have several other
> > >>> comments, but we think that the two comments above are the most
> > >>> important to consider, and we are open to continue the discussion with
> > >>> the group and, if necessary, cooperate with the group to make a proposal
> > >>> for property path evaluation that can have an efficient evaluation
> > >>> method.
> > >>>
> > >>> Best regards,
> > >>>
> > >>> Marcelo Arenas
> > >>> Sebastián Conca
> > >>> Jorge Pérez
> > >>>
> > >>>
> > >>>
> > >
> >
> >
> 
> 
>
Received on Thursday, 9 February 2012 21:11:23 UTC