Re: SPARQL performance for ORDER BY on large datasets from Andreas Langegger on 2009-08-27 (semantic-web@w3.org from August 2009)

From: Andreas Langegger <al@jku.at>
Date: Thu, 27 Aug 2009 07:53:14 +0200
To: "Seaborne, Andy" <andy.seaborne@hp.com>, Niklas Lindström <lindstream@gmail.com>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <8FA89F46-6232-412D-9709-C5B539659243@jku.at>

On Aug 27, 2009, at 12:11 AM, Seaborne, Andy wrote:
> it's a tradeoff of generality and data validity assumptions.

and generality of RDF compared to multiple tables with their own  
indexable attributes in RDBMS.

I'm wondering for some time already if there is any triple store that  
allows to define custom indexes on special predicates or subsets of  
the whole set of triples/quads? All the existing stores I know index  
over all triples in different combinations (spo, pso, ...). Is there  
any research going on towards partial indexes over user-defined  
subsets of triples? E.g. an index over all xsd:dateTime literals.

Regards,
Andy


>
>
> Talking of data validity - in the data, all the timezones,  
> regardless of time of year, are -04:00, which is a strangeness.  The  
> data was generated June 2009.  Hmm.
>
> e.g. "2001-01-29T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime 
> >
>
> 	Andy
>
>> -----Original Message-----
>> From: semantic-web-request@w3.org [mailto:semantic-web- 
>> request@w3.org] On
>> Behalf Of Niklas Lindström
>> Sent: 26 August 2009 19:16
>> To: Semantic Web
>> Subject: SPARQL performance for ORDER BY on large datasets
>>
>> Hi all!
>>
>> I have a straightforward use case which seems really hard for triple
>> stores to perform for using SPARQL (on huge datasets).
>>
>> The case: select a set of resources (based on simple criteria such as
>> type) and order them based on a property value (limiting the results
>> to a decent batch size).
>>
>> Data for test: the Library of Congress RDF dump of subject headings  
>> at
>> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703  
>> 621
>> statements).
>>
>> The query:
>>
>>    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>>    PREFIX dct: <http://purl.org/dc/terms/>
>>
>>    SELECT * WHERE {
>>        ?resource a skos:Concept;
>>            dct:modified ?modified .
>>    }
>>    ORDER BY DESC(?modified)
>>    LIMIT 100
>>
>> I tried this with Sesame (native file store), Virtuoso (opensource
>> edition), and AllegroGraph (free edition), and got terrible results
>> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
>> skipping the type match gets it down to about 20 seconds *at best*.
>>
>> Sure, it's understandable (AFAIK) that this specific case is much
>> easier to use an SQL DB for (or something like CouchDB). But I was
>> surprised that it was *this* terrible. It seems unfeasible to build
>> e.g. chronological feeds from larger RDF stores using SPARQL with
>> current tools.
>>
>>
>> Is this -- ORDER BY performance -- a commonly known problem, and
>> considered an issue of importance (for academia and implementers
>> alike)?
>>
>> Or am I missing something really obvious in the setup of these  
>> stores,
>> or in my query? I welcome *any* suggestions, such as "use triple  
>> store
>> X", "for X, make sure to configure indexing on Y". Or do RDF-using
>> service builders in general opt out to indexing in something else
>> entirely in these cases?
>>
>>
>> (It seems queries like this are present in the Berlin SPARQL  
>> Benchmark
>> (e.g. #8), but I haven't analyzed this correlation and possible
>> meanings of it in depth.)
>>
>> Best regards,
>> Niklas Lindström
>


http://www.langegger.at
----------------------------------------------------------------------
Dipl.-Ing.(FH) Andreas Langegger
FAW - Institute for Application-oriented Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69

Received on Thursday, 27 August 2009 05:53:57 UTC