Re: LIMIT problem (Paging)

> The results are to large to keep in memory, so I would like to page  
> them using LIMIT and OFFSET. However it does not work with the above  
> query. The query above needs all results to be loaded into memory  
> when evaluating it. I assume this is because more than one statement  
> is evaluated in the WHERE clause(?).

That's not why: it's because you're imposing an order with ORDER BY.

There are (broadly speaking) two ways this query could be executed.

If a store has an index on my:hasUserID (and that index happens to be  
in SPARQL's defined order!) then results can be generated in ordered  
sequence. Successive pages can be generated by re-running the query,  
skipping more and more results, or somehow holding on to a cursor.  
It's not enough to just skip userIDs: *rows* must be skipped, so the  
query does have to be executed in order to skip to the right point.

If a store does not have such an index, or your ORDER BY clause is  
more complicated, then all the results must be gathered in memory to  
be sorted. There's really no way around that.

For a store that doesn't maintain state between queries, generating  
successive pages in this manner will essentially involve running the  
whole query each time, returning a different chunk of the results. If  
you have to sort 100,000 result rows in order to determine the first  
1,000, then the second 1,000, you're going to see pretty poor  
performance.

Each query execution will reflect any changes in the store since the  
last page was generated, which can produce confusing results.


> So, how could I page the above query?

Do it in your application. That way you also avoid the data changing  
between pages.

I don't think that LIMIT and OFFSET are useful for supporting paging,  
because the spec does not mandate sufficient efficiency constraints on  
implementations (such as cursors, as provided by Freebase MQL  
queries). It's odd to say "you could do it using the method the spec  
recommends, but you'd be crazy to do so with real datasets". I  
consider LIMIT's only real use to be for constraining the size of the  
result set, not defining a page size.

IMO it would be much more useful to separate SPARQL execution into two  
phases: a query that returns a result set, and then operations on the  
result set (such as serializing slices of it). Conflating the two  
places the burden of doing paging efficiently onto implementation, and  
there's no one good solution for all clients.

-R

Received on Monday, 25 May 2009 18:20:35 UTC