Re: Sorting from Leigh Dodds on 2005-03-09 (public-rdf-dawg-comments@w3.org from March 2005)

From: Leigh Dodds <leigh@ldodds.com>
Date: Wed, 09 Mar 2005 09:46:57 +0000
To: Dan Connolly <connolly@w3.org>
CC: public-rdf-dawg-comments@w3.org, danny.ayers@gmail.com
Message-ID: <422EC611.2000109@ldodds.com>
Dan Connolly wrote:

> You're welcome to elaborate on why you think it's important/required. 
> Use cases are particularly welcome, especially use cases that argue for 
> handling sorting in SPARQL rather than in a downstream component or 
> client or XSLT engine or the like.

OK, most of my examples will come from the bibliographic domain, as
thats the application area I currently work in. We're at present 
prototyping a replacement for our content storage systems using an RDF
triple store and are hoping to use Sparql to query that store.

Sorting of results, e.g. articles in a TOC, or issues in a journal,
items in a users reading list, have been implemented at both levels:
in the query layer, e.g. when using a SQL database; in the application
layer, e.g. when sorting criteria are more complex (serial issue release
dates, special ordering for supplements, indexes, etc).

We recently pushed code back from the application layer into the
query, where necessary implementing custom comparators. The results were
an improvement in application performance, as well as simplifying
the application itself: procedural code to invoke a sort became a
declarative aspect of the query.

Use cases for sorting in our application include:

- retrieve all articles associated with an issue and sort them by
page number.

- retrieve all issues associated with a journal and sort them by
publication date.

- retrieve all articles bookmarked by a user and sort them by
journal name or date bookmarked.

- retrieve all journals within a subject area and sort them by
name

- retrieve all articles written by an author, and sort them by
publication date

All of these can be implemented at a higher layer at the cost
of implementing custom comparators - one for each data type.

The nice aspect of having a query contain all the application
criteria (specifying "WHERE" clauses, ordering, limits) is that
the query engine has much more information available to it to
allow optimisation. E.g. a triple store backed by a relational
engine may be able to optimise its queries to use native sorting
capabilities. Even manual optimisation becomes easier when the
query is self-contained.

I note the interaction between LIMIT and ORDER BY, but would argue
that LIMIT is unnecessary: I can merely fetch the first n results
that I'm interested in.

Looking through the queries I'd typically write against a relational 
store, I find that I'm heavily reliant on ordering and rarely use 
anything like LIMIT: it's much more likely that I only want the first 
10, then the next 10, etc; unless I'm missing something paging isn't 
possible with LIMIT as specified. Where I do have a use for it,
e.g. as in Danny's use case, its much simpler to implement at an
application level than sorting.

There are systems that support both LIMIT and ORDER BY: search engines.
E.g. order by relevance, but just return the first 20. IIRC Google
applies its PageRank (a sort) to the first 1000 results or so (a limit). 
I know of other implementors that have taken similar approaches. It's 
not "ideal", but I note it as one possible implementation approach.

(Aside: I'd also argue that ASK is unnecessary too, as I can merely test 
for a non-empty result set from a SELECT query; but thats a different 
thread)

Hope thats useful.

Cheers,

L.
Received on Wednesday, 9 March 2005 09:47:01 UTC