Re: SPARQL performance for ORDER BY on large datasets

On  26 Aug 2009, at 11:30 AM, Dan Brickley wrote:

> In case anyone missed this query on the semantic-web list...

Thanks Dan.

> ---------- Forwarded message ----------
> From: Niklas Lindström <lindstream@gmail.com>
> Date: 2009/8/26
> Subject: SPARQL performance for ORDER BY on large datasets
> To: Semantic Web <semantic-web@w3.org>
>
>
> Hi all!

Hi Niklas,

Responding here, a month late, because I no longer follow the semantic- 
web@w3 list.


> Is this -- ORDER BY performance -- a commonly known problem, and
> considered an issue of importance (for academia and implementers
> alike)?

Yes, I'd say it's a known problem. The main reason, as I see it, is  
that ORDER BY in SPARQL is hard for implementations to do efficiently.

* SPARQL defines its own ordering for values. That means an  
implementation which uses some other ordering (either for  
implementation-specific reasons such as compression, or because it  
supports other query languages, efficient typed range queries as does  
AllegroGraph, or whatever) is at a natural disadvantage compared to  
SQL DBs. A SQL DB can just walk an appropriate index and avoid sorting  
the results altogether, because its internal representations align  
with SQL; that's usually not true of a triple store that implements  
SPARQL. (SPARQL is young.)

* SPARQL ordering is *extensible*. Even an implementation that uses  
SPARQL ordering natively for its indices is screwed if you define your  
own operator mapping for <. I'd be surprised if any implementation  
rebuilt custom indices every time a user defined a new operator mapping.

* Descending and ascending orders can be an annoyance if indices are  
built to be walked in one direction (by metaindexing and binary  
search, for example).


If you can't guarantee an in-order walk (which, as I mentioned above,  
you usually can't), the semantics of SPARQL essentially require that  
the implementation generate all possible results, sorted, prior to  
applying the LIMIT.

I see this a lot: a customer wonders why adding LIMIT 10 to their  
large, ordered query doesn't cause it to run in a millisecond. Of  
course, all the results must still be computed. (In this case, even  
computing only the ?modified bindings is the majority of the work, so  
partitioning the query into "must run to sort" and "can run after  
LIMIT" doesn't help.)

In short: there are a number of interactions between SPARQL and RDF,  
and design decisions in SPARQL itself, which make it tiresome or  
impossible as an implementor to make these queries fast. As an  
industry we simply haven't had the time to push the specs towards  
efficiency, to learn tricks, or fine-tune. Contrast this to the  
decades 'enjoyed' by the SQL crowd. SQL and relational databases have  
evolved together for many years.

This is arguably a point against the massive stack of XMLey, webby  
specs that make up the Semantic Web layer cake: all of the semantic  
translations between these things make *every* query, *every* load  
that little bit slower. Defining SPARQL operations using XML Schema  
Datatypes as a base, for example, puts XSD datatype promotion logic  
into every (apparently simple) numeric comparison. Oof.


> Or am I missing something really obvious in the setup of these stores,
> or in my query? I welcome *any* suggestions, such as "use triple store
> X", "for X, make sure to configure indexing on Y". Or do RDF-using
> service builders in general opt out to indexing in something else
> entirely in these cases?

In AllegroGraph, I'd advise something like the following:

* Store your modified times as encoded values, not RDF literals.  
Simply defining a datatype mapping for :date-time = xsd:dateTime prior  
to loading will suffice. That will immediately make the ordering  
comparisons several times faster.

* Ensure your store is completely indexed. Novice AllegroGraph users  
often omit this vital step. It doesn't sound like you did, but best to  
check.

* Use range queries to bound the results. If you know that there are  
more than 100 results between the dawn of time and 1990, let the query  
engine know:

   FILTER (?modified < "1990-01-01T00:00:00Z")

Armed with that information, the store can reduce the amount of work  
it has to do in the ORDER BY stage by eliminating results outside the  
time window. (If you use encoded datetimes, this FILTER will directly  
affect the index walk.)

You might need to trick the planner to get it to do the right thing,  
but that's easy enough.

Feel free to contact me directly if you'd like more advice on this.

You could even build a tree of approximate numbers of entries per  
year, load the data into different named graphs based on year, run  
multiple queries, whatever. There are workarounds, but that doesn't  
address your main point.


> (It seems queries like this are present in the Berlin SPARQL Benchmark
> (e.g. #8), but I haven't analyzed this correlation and possible
> meanings of it in depth.)

My personal opinion: the BSBM serves a limited purpose for people  
evaluating triple stores, but strikes me as very SQL-ey in style: the  
data are the opposite of sparse, and it's not a network. Relational  
databases are a much, much better fit for this problem, and thus it's  
not very interesting. It's a little benchmarking how well an Excel  
spreadsheet can do pixel animation: sure, you can do it, but there are  
other tools which are both mature and more suitable, so why bother?

(Then again, I'm a "right tool for the job" guy; I use a whole toolbox  
of languages, utilities, and services, and I don't expect any one to  
effectively substitute for another. It makes me laugh when I see  
people trying to store 100MB documents in a database or triple store.)

However, SPARQL itself is a very relational language, lacking any kind  
of graph-walking, transitivity etc. support, which makes it almost  
intentionally unsuited to working with sparse, graph-like networks of  
assertions. Make of this what you will.

Regards,

-Richard

Received on Saturday, 3 October 2009 02:16:43 UTC