Re: SPARQL performance for ORDER BY on large datasets

yes, since D2R pushes down filters now it will use available indexes  
at the RDB level.

The data Niklas wants to sparql-query are native RDF. Thus, in order  
to use D2R over RDBMS with indexes, it would require him to transform  
all data back into SQL tables, how evil... ;-)

The idea was, whould it be possible to define partial indexes for  
native RDF stores such as TDB?

s   p   o
------------
     :p       ^
     :p       |
     :p       | index over :p
     :p       v
     :q       ^
     :q       | index over :q (and same object ranges, e.g.  
xsd:dateTime)
     :q       v
     ...

regards
AndyL


On Aug 27, 2009, at 11:29 AM, Seaborne, Andy wrote:
>
> Not just triple stores.  Does D2RQ enable this?  The data physical  
> organisation could have a dct:modified table with dateTime sorted  
> values and the ORDER BY is not more that choosing the right end of  
> the table to start at.
>
> 	Andy
>
>>
>>
>>>
>>>
>>> Talking of data validity - in the data, all the timezones,
>>> regardless of time of year, are -04:00, which is a strangeness.  The
>>> data was generated June 2009.  Hmm.
>>>
>>> e.g. "2001-01-29T00:00:00-
>> 04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime
>>>>
>>>
>>> 	Andy
>>>
>>>> -----Original Message-----
>>>> From: semantic-web-request@w3.org [mailto:semantic-web-
>>>> request@w3.org] On
>>>> Behalf Of Niklas Lindström
>>>> Sent: 26 August 2009 19:16
>>>> To: Semantic Web
>>>> Subject: SPARQL performance for ORDER BY on large datasets
>>>>
>>>> Hi all!
>>>>
>>>> I have a straightforward use case which seems really hard for  
>>>> triple
>>>> stores to perform for using SPARQL (on huge datasets).
>>>>
>>>> The case: select a set of resources (based on simple criteria  
>>>> such as
>>>> type) and order them based on a property value (limiting the  
>>>> results
>>>> to a decent batch size).
>>>>
>>>> Data for test: the Library of Congress RDF dump of subject headings
>>>> at
>>>> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703
>>>> 621
>>>> statements).
>>>>
>>>> The query:
>>>>
>>>>   PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>>>   PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>>>>   PREFIX dct: <http://purl.org/dc/terms/>
>>>>
>>>>   SELECT * WHERE {
>>>>       ?resource a skos:Concept;
>>>>           dct:modified ?modified .
>>>>   }
>>>>   ORDER BY DESC(?modified)
>>>>   LIMIT 100
>>>>
>>>> I tried this with Sesame (native file store), Virtuoso (opensource
>>>> edition), and AllegroGraph (free edition), and got terrible results
>>>> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
>>>> skipping the type match gets it down to about 20 seconds *at best*.
>>>>
>>>> Sure, it's understandable (AFAIK) that this specific case is much
>>>> easier to use an SQL DB for (or something like CouchDB). But I was
>>>> surprised that it was *this* terrible. It seems unfeasible to build
>>>> e.g. chronological feeds from larger RDF stores using SPARQL with
>>>> current tools.
>>>>
>>>>
>>>> Is this -- ORDER BY performance -- a commonly known problem, and
>>>> considered an issue of importance (for academia and implementers
>>>> alike)?
>>>>
>>>> Or am I missing something really obvious in the setup of these
>>>> stores,
>>>> or in my query? I welcome *any* suggestions, such as "use triple
>>>> store
>>>> X", "for X, make sure to configure indexing on Y". Or do RDF-using
>>>> service builders in general opt out to indexing in something else
>>>> entirely in these cases?
>>>>
>>>>
>>>> (It seems queries like this are present in the Berlin SPARQL
>>>> Benchmark
>>>> (e.g. #8), but I haven't analyzed this correlation and possible
>>>> meanings of it in depth.)
>>>>
>>>> Best regards,
>>>> Niklas Lindström
>>>
>>
>>
>> http://www.langegger.at
>> ----------------------------------------------------------------------
>> Dipl.-Ing.(FH) Andreas Langegger
>> FAW - Institute for Application-oriented Knowledge Processing
>> Johannes Kepler University Linz
>> A-4040 Linz, Altenberger Straße 69
>>
>>
>>
>>
>>
>


http://www.langegger.at
----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
FAW - Institute for Application-oriented Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69

Received on Thursday, 27 August 2009 09:40:32 UTC