- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Thu, 27 Aug 2009 09:29:46 +0000
- To: Andreas Langegger <al@jku.at>
- CC: Semantic Web <semantic-web@w3.org>
> -----Original Message-----
> From: Andreas Langegger [mailto:al@jku.at]
> Sent: 27 August 2009 06:53
> To: Seaborne, Andy; Niklas Lindström
> Cc: Semantic Web
> Subject: Re: SPARQL performance for ORDER BY on large datasets
>
> On Aug 27, 2009, at 12:11 AM, Seaborne, Andy wrote:
> > it's a tradeoff of generality and data validity assumptions.
>
> and generality of RDF compared to multiple tables with their own
> indexable attributes in RDBMS.
>
> I'm wondering for some time already if there is any triple store that
> allows to define custom indexes on special predicates or subsets of
> the whole set of triples/quads? All the existing stores I know index
> over all triples in different combinations (spo, pso, ...). Is there
> any research going on towards partial indexes over user-defined
> subsets of triples? E.g. an index over all xsd:dateTime literals.
>
> Regards,
> Andy
Not just triple stores. Does D2RQ enable this? The data physical organisation could have a dct:modified table with dateTime sorted values and the ORDER BY is not more that choosing the right end of the table to start at.
Andy
>
>
> >
> >
> > Talking of data validity - in the data, all the timezones,
> > regardless of time of year, are -04:00, which is a strangeness. The
> > data was generated June 2009. Hmm.
> >
> > e.g. "2001-01-29T00:00:00-
> 04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime
> > >
> >
> > Andy
> >
> >> -----Original Message-----
> >> From: semantic-web-request@w3.org [mailto:semantic-web-
> >> request@w3.org] On
> >> Behalf Of Niklas Lindström
> >> Sent: 26 August 2009 19:16
> >> To: Semantic Web
> >> Subject: SPARQL performance for ORDER BY on large datasets
> >>
> >> Hi all!
> >>
> >> I have a straightforward use case which seems really hard for triple
> >> stores to perform for using SPARQL (on huge datasets).
> >>
> >> The case: select a set of resources (based on simple criteria such as
> >> type) and order them based on a property value (limiting the results
> >> to a decent batch size).
> >>
> >> Data for test: the Library of Congress RDF dump of subject headings
> >> at
> >> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703
> >> 621
> >> statements).
> >>
> >> The query:
> >>
> >> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> >> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> >> PREFIX dct: <http://purl.org/dc/terms/>
> >>
> >> SELECT * WHERE {
> >> ?resource a skos:Concept;
> >> dct:modified ?modified .
> >> }
> >> ORDER BY DESC(?modified)
> >> LIMIT 100
> >>
> >> I tried this with Sesame (native file store), Virtuoso (opensource
> >> edition), and AllegroGraph (free edition), and got terrible results
> >> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting;
> >> skipping the type match gets it down to about 20 seconds *at best*.
> >>
> >> Sure, it's understandable (AFAIK) that this specific case is much
> >> easier to use an SQL DB for (or something like CouchDB). But I was
> >> surprised that it was *this* terrible. It seems unfeasible to build
> >> e.g. chronological feeds from larger RDF stores using SPARQL with
> >> current tools.
> >>
> >>
> >> Is this -- ORDER BY performance -- a commonly known problem, and
> >> considered an issue of importance (for academia and implementers
> >> alike)?
> >>
> >> Or am I missing something really obvious in the setup of these
> >> stores,
> >> or in my query? I welcome *any* suggestions, such as "use triple
> >> store
> >> X", "for X, make sure to configure indexing on Y". Or do RDF-using
> >> service builders in general opt out to indexing in something else
> >> entirely in these cases?
> >>
> >>
> >> (It seems queries like this are present in the Berlin SPARQL
> >> Benchmark
> >> (e.g. #8), but I haven't analyzed this correlation and possible
> >> meanings of it in depth.)
> >>
> >> Best regards,
> >> Niklas Lindström
> >
>
>
> http://www.langegger.at
> ----------------------------------------------------------------------
> Dipl.-Ing.(FH) Andreas Langegger
> FAW - Institute for Application-oriented Knowledge Processing
> Johannes Kepler University Linz
> A-4040 Linz, Altenberger Straße 69
>
>
>
>
>
Received on Thursday, 27 August 2009 09:31:01 UTC