Comments about the semantics of property paths from jorge perez on 2012-01-27 (public-rdf-dawg-comments@w3.org from January 2012)

From: jorge perez <jorge.perez.rojas@gmail.com>
Date: Fri, 27 Jan 2012 11:51:45 -0300
To: public-rdf-dawg-comments@w3.org
Cc: Marcelo Arenas <marcelo.arenas1@gmail.com>, Sebastián Conca <sconca87@gmail.com>, jorge perez <jorge.perez.rojas@gmail.com>
Message-ID: <CAMAacUBTzYMfrDJnTi_aKHL+b=-hFN_v-rBow29swSpAD1nXZQ@mail.gmail.com>
Dear DAWG members,

We have some comments regarding the semantics of property paths. We
know that this issue has been raised before, but we think that we can
provide substantial new information to reconsider it.

We have conducted a thorough study of the current semantics of
property paths including an empirical analysis. All our results are in
a paper that has been accepted in WWW 2012. You can find a
copy of the extended version of this paper at
http://www.dcc.uchile.cl/~jperez/sub-www-ext.pdf. Given the tight
schedule of the group, we think that it might be useful to make these
results public for the group before we have a final published version.

As a summary we can provide two main comments, one from a practical
perspective and another from a theoretical perspective.

-----------

- Comment 1: Poor performance of current engines.
=================================================
We tested 4 implementations of property paths: Jena, RDF::Query,
Sesame, and KGram-Corese (details on the experimental setting can be
found at http://www.dcc.uchile.cl/~jperez/www_repeatability/). A first
set of experiments was with synthetic data and other with real data.

In both cases the implementations were not capable to handle even
small data for the most simple property path queries.

Case A)
We tested RDF data representing complete graphs. No implementation was
able to handle a graph with 13 nodes for a query with a single
property path of the form (:P)*

data1: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique13.n3
query1: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-1.rq

See Figure 1 in the paper for the performance of all implementations
below 13 nodes. The figure suggests that the evaluation time for these
implementations growths doubly-exponentially w.r.t. the size of the
input.

Case B)
We tested real RDF data crawled from a small set of foaf documents. We
started from Axel's foaf document and retrieve friends, friends of
friends, etc. following foaf:knows links, and constructed several test
cases. In this case, no implementation was able to handle an RDF graph
of 14KB for a query with a single property path (foaf:knows)*

data2: http://www.dcc.uchile.cl/~jperez/www_repeatability/E.n3
query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq

See Table 1 in the paper for the performance of all implementations.

- Comment 2: High Computational Complexity.
===========================================
We prove in the paper that for the current semantics of property paths
in SPARQL the complexity of evaluation is double-exponential (Lemma
5.4 and Theorem 5.5). Given that property paths require counting
paths, we measure the complexity by making use of counting complexity
classes. The technical result is that SPARQL 1.1 evaluation is not
even inside #P (Theorem 5.5), where #P is the counting complexity
class associated to NP (a prototypical #P-complete problem is the problem
of computing the number of truth assignments that satisfies a propositional
formula, which is more complicated than the prototypical NP-complete problem
which is to verify whether there exists at least one truth assignment that
satisfies a propositional formula). Thus, in informal terms, we prove that
SPARQL 1.1 evaluation considering counting is even more complex than
solving an NP-complete problem.

We also prove that if only the input data is considered to measure the
complexity of the problem, then the evaluation problem is #P-hard.
Notice that without property paths, the evaluation problem for SPARQL
can be solved in polynomial time (if the complexity is measured only in
terms of the size of the data).

------------

Discussion
==========

One of the main conclusions that one can draw from our results is that
the poor performance exhibited in Cases A) and B) above is not a
problem of the particular implementations but a problem of the
specification itself, as our theoretical results imply that every
implementation that follows the current specification of SPARQL 1.1
would have the same poor behavior.

Our results also show that the main source of complexity is the
requirement of counting paths, and in particular the procedure ALP
which is the one that gives the semantics for counting. Essentially,
the counting mechanism produces a number of duplicates that in some
cases are beyond any naturally feasible number. Table 7 in the paper
shows a worst case analysis. For instance, for the case

data3: http://www.dcc.uchile.cl/~jperez/www_repeatability/clique7.n3
query3: http://www.dcc.uchile.cl/~jperez/www_repeatability/Cliq-2.rq

we formally prove that any implementation that follows the current
specification should produce an output of 79 Yottabytes (79 trillion
Terabytes), and thus would not fit in any reasonable storage device.
Please notice that unfeasible counting can also be obtained with real
data. For example, for the case

data4: http://www.dcc.uchile.cl/~jperez/www_repeatability/D.n3
query2: http://www.dcc.uchile.cl/~jperez/www_repeatability/Foaf-1.rq

ARQ (which was the only implementation that was able to handle this
case in less than one hour) produced an output of 587MB. Notice that
data3 is of only 13.2KB. Table 6 in the paper shows the running time
and the output size. Please notice that this experiment is with real
data and it is highly probable that property paths will be used in
practice with this type of queries.

It is worth mentioning that our group is not the only one that have
formally studied property-path semantics according to the current
specification, and that have shown negative results about the complexity
of evaluating it. We are aware that Katja Losemann and Wim Martens
obtained similar results independently from us. Wim Martens gave a
talk about this called "The complexity of evaluating path expressions
in SPARQL" in a Dagstuhl seminar. In that work, the authors also
studied property-path expressions of the form :P{m,n}, and show that
the complexity of evaluating them is very high.

We think that we have provided substantial new information to
reconsider the issue of property path semantics. We have several other
comments, but we think that the two comments above are the most
important to consider, and we are open to continue the discussion with
the group and, if necessary, cooperate with the group to make a proposal
for property path evaluation that can have an efficient evaluation method.

Best regards,

Marcelo Arenas
Sebastián Conca
Jorge Pérez
Received on Friday, 27 January 2012 14:52:22 UTC