RE: SPARQL performance for ORDER BY on large datasets from Seaborne, Andy on 2009-08-27 (semantic-web@w3.org from August 2009)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 27 Aug 2009 09:29:46 +0000
To: Andreas Langegger <al@jku.at>
CC: Semantic Web <semantic-web@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA3693CC708D3@GVW1118EXC.americas.hpqcorp.net>


> -----Original Message-----
> From: Andreas Langegger [mailto:al@jku.at]
> Sent: 27 August 2009 06:53
> To: Seaborne, Andy; Niklas Lindström
> Cc: Semantic Web
> Subject: Re: SPARQL performance for ORDER BY on large datasets
> 
> On Aug 27, 2009, at 12:11 AM, Seaborne, Andy wrote:
> > it's a tradeoff of generality and data validity assumptions.
> 
> and generality of RDF compared to multiple tables with their own
> indexable attributes in RDBMS.
> 
> I'm wondering for some time already if there is any triple store that
> allows to define custom indexes on special predicates or subsets of
> the whole set of triples/quads? All the existing stores I know index
> over all triples in different combinations (spo, pso, ...). Is there
> any research going on towards partial indexes over user-defined
> subsets of triples? E.g. an index over all xsd:dateTime literals.
> 
> Regards,
> Andy


Not just triple stores.  Does D2RQ enable this?  The data physical organisation could have a dct:modified table with dateTime sorted values and the ORDER BY is not more that choosing the right end of the table to start at.

 Andy

> 
> 
> >
> >
> > Talking of data validity - in the data, all the timezones,
> > regardless of time of year, are -04:00, which is a strangeness.  The
> > data was generated June 2009.  Hmm.
> >
> > e.g. "2001-01-29T00:00:00-
> 04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime

> > >
> >
> >  Andy
> >
> >> -----Original Message-----
> >> From: semantic-web-request@w3.org [mailto:semantic-web-
> >> request@w3.org] On
> >> Behalf Of Niklas Lindström
> >> Sent: 26 August 2009 19:16
> >> To: Semantic Web
> >> Subject: SPARQL performance for ORDER BY on large datasets
> >>
> >> Hi all!
> >>
> >> I have a straightforward use case which seems really hard for triple
> >> stores to perform for using SPARQL (on huge datasets).
> >>
> >> The case: select a set of resources (based on simple criteria such as
> >> type) and order them based on a property value (limiting the results
> >> to a decent batch size).
> >>
> >> Data for test: the Library of Congress RDF dump of subject headings
> >> at
> >> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703
> >> 621
> >> statements).
> >>
> >> The query:
> >>
> >>    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> >>    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> >>    PREFIX dct: <http://purl.org/dc/terms/>
> >>
> >>    SELECT * WHERE {
> >>        ?resource a skos:Concept;
> >>            dct:modified ?modified .
> >>    }
> >>    ORDER BY DESC(?modified)
> >>    LIMIT 100
> >>
> >> I tried this with Sesame (native file store), Virtuoso (opensource
> >> edition), and AllegroGraph (free edition), and got terrible results
> >> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
> >> skipping the type match gets it down to about 20 seconds *at best*.
> >>
> >> Sure, it's understandable (AFAIK) that this specific case is much
> >> easier to use an SQL DB for (or something like CouchDB). But I was
> >> surprised that it was *this* terrible. It seems unfeasible to build
> >> e.g. chronological feeds from larger RDF stores using SPARQL with
> >> current tools.
> >>
> >>
> >> Is this -- ORDER BY performance -- a commonly known problem, and
> >> considered an issue of importance (for academia and implementers
> >> alike)?
> >>
> >> Or am I missing something really obvious in the setup of these
> >> stores,
> >> or in my query? I welcome *any* suggestions, such as "use triple
> >> store
> >> X", "for X, make sure to configure indexing on Y". Or do RDF-using
> >> service builders in general opt out to indexing in something else
> >> entirely in these cases?
> >>
> >>
> >> (It seems queries like this are present in the Berlin SPARQL
> >> Benchmark
> >> (e.g. #8), but I haven't analyzed this correlation and possible
> >> meanings of it in depth.)
> >>
> >> Best regards,
> >> Niklas Lindström
> >
> 
> 
> http://www.langegger.at

> ----------------------------------------------------------------------
> Dipl.-Ing.(FH) Andreas Langegger
> FAW - Institute for Application-oriented Knowledge Processing
> Johannes Kepler University Linz
> A-4040 Linz, Altenberger Straße 69
> 
> 
> 
> 
>
Received on Thursday, 27 August 2009 09:31:01 UTC