SPARQL Query Problem - perhaps solvable in 1.1?

OK, I've hit up against a problem which at first reading, sounds simple
enough, but I'm pretty sure is unsolvable in SPARQL 1.0. I'll outline
this problem below. Any suggestions as to how it can be solved in SPARQL
1.0 would be gratefully received. But if you're as convinced as I am
that it's unsolvable, I also have a suggested feature for SPARQL 1.1
that should solve it.

So the problem. I'm collecting a bunch of RSS feeds into a triple store
and trying to create a single list of articles from them. I need to be
able to filter the list by date range and order it by date. The people
behind the RSS 1.0 spec (in their infinite wisdom) decided that a date
property would not be needed, so RSS feeds will typically contain a
mixed bag of different properties that could describe the publication
date of the items. I'm focusing on the following four:

 <http://purl.org/dc/terms/created>
 <http://purl.org/dc/terms/issued>
 <http://purl.org/dc/terms/date>
 <http://purl.org/dc/elements/1.1/date>

When an RSS item has a dcterms:created date, then I essentially want to
treat that as authoritative and ignore everything else. If there is no
dcterms:created, then I'd fall back to dcterms:issued, treating that as
authoritative and ignoring the other two terms, and so on.

Filtering for a date range - say, all the items published in 2008 - is
pretty tricky, but is achievable. My solution is to bind each date to a
different variable (dcterms:created is ?date1, dcterms:issued is ?date2,
etc) in OPTIONAL clauses and then do a filter like this:

FILTER (
  (bound(?date1) && inRange(?date1))
  ||
  (!bound(?date1) && bound(?date2) && inRange(?date2))
  ||
  (!bound(?date1) && !bound(?date2) && bound(?date3) && inRange(?date3))
  ||
  (!bound(?date1) && !bound(?date2) && !bound(?date3) && bound(?date4) && inRange(?date4))
)

where "inRange" in the above pseudo-code actually represents some
xsd:dateTime type casts and greater-than and less-than comparisons.

This is ugly, yes, but it works. A lot of the complicatedness comes from
the fact that the date property found first in my order of priorities
needs to be treated as completely authoritative. So that, for example,
if an item has dcterms:created only in 2007, then when filtering for
items in 2008, that item will not be found, even if it has
dcterms:issued, dcterms:date and dc:date properties all with values in
2008!

Anyway, as I said, this is ugly, but it works. However, results are of
course returned in no particular order. Right now, I pull these results
into my application and sort them into date order there. But I'd like
the SPARQL query engine to take care of the sorting itself - in
particular, that way I'd be able to get the query engine to apply any
LIMIT and OFFSET I wanted, saving a lot of communications overhead
between the query engine and the application in the case where, say,
there are 500 matching items but I only want the first 10 ordered by
date.

Michael Hausenblas on #swig suggested a solution which at first glance
looks like it might work, and looks really easy:

 ORDER BY (?date1 || ?date2 || ?date3 || ?date4)

However, this doesn't work, as the || operator seems to always return an
xsd:boolean - not (as is the case in many other programming languages)
the first non-false literal value that was passed to it.

I'm fairly convinced that ordering my results is not possible without
breaking my filter.

If you need some test data to play with, I've put some here:

http://buzzword.org.uk/2009/sparql-test-data-1.ttl

The expected results should be :inRange4, :inRange3, :inRange2
and :inRange1 - in that order.

A very simple solution, which would solve the ordering problem (and also
greatly simplify the filter) would be for SPARQL to borrow the COALESCE
function from SQL. For those not familiar with COALESCE, it takes a
variable number of arguments, and returns the first of those arguments
which is not null. (In the SPARQL case, it would be the first which is
bound.)

That would make my filter as simple as:

 FILTER (inRange(COALESCE(?date1,?date2,?date3,?date4)))

And my sorting as easy as:

 ORDER BY (COALESCE(?date1,?date2,?date3,?date4))

Failing that, even an if-then-else tertiary operator would be useful:

 ORDER BY (
   if bound(?date1)
   then ?date1
   else (
     if bound(?date2) ...
   )
 )

-- 
Toby A Inkster
<mailto:mail@tobyinkster.co.uk>
<http://tobyinkster.co.uk>

Received on Tuesday, 25 August 2009 10:40:55 UTC