- From: Richard Newman <rnewman@franz.com>
- Date: Fri, 2 Oct 2009 19:16:08 -0700
- To: lindstream@gmail.com
- Cc: public-sparql-dev@w3.org
On 26 Aug 2009, at 11:30 AM, Dan Brickley wrote: > In case anyone missed this query on the semantic-web list... Thanks Dan. > ---------- Forwarded message ---------- > From: Niklas Lindström <lindstream@gmail.com> > Date: 2009/8/26 > Subject: SPARQL performance for ORDER BY on large datasets > To: Semantic Web <semantic-web@w3.org> > > > Hi all! Hi Niklas, Responding here, a month late, because I no longer follow the semantic- web@w3 list. > Is this -- ORDER BY performance -- a commonly known problem, and > considered an issue of importance (for academia and implementers > alike)? Yes, I'd say it's a known problem. The main reason, as I see it, is that ORDER BY in SPARQL is hard for implementations to do efficiently. * SPARQL defines its own ordering for values. That means an implementation which uses some other ordering (either for implementation-specific reasons such as compression, or because it supports other query languages, efficient typed range queries as does AllegroGraph, or whatever) is at a natural disadvantage compared to SQL DBs. A SQL DB can just walk an appropriate index and avoid sorting the results altogether, because its internal representations align with SQL; that's usually not true of a triple store that implements SPARQL. (SPARQL is young.) * SPARQL ordering is *extensible*. Even an implementation that uses SPARQL ordering natively for its indices is screwed if you define your own operator mapping for <. I'd be surprised if any implementation rebuilt custom indices every time a user defined a new operator mapping. * Descending and ascending orders can be an annoyance if indices are built to be walked in one direction (by metaindexing and binary search, for example). If you can't guarantee an in-order walk (which, as I mentioned above, you usually can't), the semantics of SPARQL essentially require that the implementation generate all possible results, sorted, prior to applying the LIMIT. I see this a lot: a customer wonders why adding LIMIT 10 to their large, ordered query doesn't cause it to run in a millisecond. Of course, all the results must still be computed. (In this case, even computing only the ?modified bindings is the majority of the work, so partitioning the query into "must run to sort" and "can run after LIMIT" doesn't help.) In short: there are a number of interactions between SPARQL and RDF, and design decisions in SPARQL itself, which make it tiresome or impossible as an implementor to make these queries fast. As an industry we simply haven't had the time to push the specs towards efficiency, to learn tricks, or fine-tune. Contrast this to the decades 'enjoyed' by the SQL crowd. SQL and relational databases have evolved together for many years. This is arguably a point against the massive stack of XMLey, webby specs that make up the Semantic Web layer cake: all of the semantic translations between these things make *every* query, *every* load that little bit slower. Defining SPARQL operations using XML Schema Datatypes as a base, for example, puts XSD datatype promotion logic into every (apparently simple) numeric comparison. Oof. > Or am I missing something really obvious in the setup of these stores, > or in my query? I welcome *any* suggestions, such as "use triple store > X", "for X, make sure to configure indexing on Y". Or do RDF-using > service builders in general opt out to indexing in something else > entirely in these cases? In AllegroGraph, I'd advise something like the following: * Store your modified times as encoded values, not RDF literals. Simply defining a datatype mapping for :date-time = xsd:dateTime prior to loading will suffice. That will immediately make the ordering comparisons several times faster. * Ensure your store is completely indexed. Novice AllegroGraph users often omit this vital step. It doesn't sound like you did, but best to check. * Use range queries to bound the results. If you know that there are more than 100 results between the dawn of time and 1990, let the query engine know: FILTER (?modified < "1990-01-01T00:00:00Z") Armed with that information, the store can reduce the amount of work it has to do in the ORDER BY stage by eliminating results outside the time window. (If you use encoded datetimes, this FILTER will directly affect the index walk.) You might need to trick the planner to get it to do the right thing, but that's easy enough. Feel free to contact me directly if you'd like more advice on this. You could even build a tree of approximate numbers of entries per year, load the data into different named graphs based on year, run multiple queries, whatever. There are workarounds, but that doesn't address your main point. > (It seems queries like this are present in the Berlin SPARQL Benchmark > (e.g. #8), but I haven't analyzed this correlation and possible > meanings of it in depth.) My personal opinion: the BSBM serves a limited purpose for people evaluating triple stores, but strikes me as very SQL-ey in style: the data are the opposite of sparse, and it's not a network. Relational databases are a much, much better fit for this problem, and thus it's not very interesting. It's a little benchmarking how well an Excel spreadsheet can do pixel animation: sure, you can do it, but there are other tools which are both mature and more suitable, so why bother? (Then again, I'm a "right tool for the job" guy; I use a whole toolbox of languages, utilities, and services, and I don't expect any one to effectively substitute for another. It makes me laugh when I see people trying to store 100MB documents in a database or triple store.) However, SPARQL itself is a very relational language, lacking any kind of graph-walking, transitivity etc. support, which makes it almost intentionally unsuited to working with sparse, graph-like networks of assertions. Make of this what you will. Regards, -Richard
Received on Saturday, 3 October 2009 02:16:43 UTC