- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Thu, 27 Aug 2009 09:29:46 +0000
- To: Andreas Langegger <al@jku.at>
- CC: Semantic Web <semantic-web@w3.org>
> -----Original Message----- > From: Andreas Langegger [mailto:al@jku.at] > Sent: 27 August 2009 06:53 > To: Seaborne, Andy; Niklas Lindström > Cc: Semantic Web > Subject: Re: SPARQL performance for ORDER BY on large datasets > > On Aug 27, 2009, at 12:11 AM, Seaborne, Andy wrote: > > it's a tradeoff of generality and data validity assumptions. > > and generality of RDF compared to multiple tables with their own > indexable attributes in RDBMS. > > I'm wondering for some time already if there is any triple store that > allows to define custom indexes on special predicates or subsets of > the whole set of triples/quads? All the existing stores I know index > over all triples in different combinations (spo, pso, ...). Is there > any research going on towards partial indexes over user-defined > subsets of triples? E.g. an index over all xsd:dateTime literals. > > Regards, > Andy Not just triple stores. Does D2RQ enable this? The data physical organisation could have a dct:modified table with dateTime sorted values and the ORDER BY is not more that choosing the right end of the table to start at. Andy > > > > > > > > Talking of data validity - in the data, all the timezones, > > regardless of time of year, are -04:00, which is a strangeness. The > > data was generated June 2009. Hmm. > > > > e.g. "2001-01-29T00:00:00- > 04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime > > > > > > > Andy > > > >> -----Original Message----- > >> From: semantic-web-request@w3.org [mailto:semantic-web- > >> request@w3.org] On > >> Behalf Of Niklas Lindström > >> Sent: 26 August 2009 19:16 > >> To: Semantic Web > >> Subject: SPARQL performance for ORDER BY on large datasets > >> > >> Hi all! > >> > >> I have a straightforward use case which seems really hard for triple > >> stores to perform for using SPARQL (on huge datasets). > >> > >> The case: select a set of resources (based on simple criteria such as > >> type) and order them based on a property value (limiting the results > >> to a decent batch size). > >> > >> Data for test: the Library of Congress RDF dump of subject headings > >> at > >> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 > >> 621 > >> statements). > >> > >> The query: > >> > >> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> > >> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> > >> PREFIX dct: <http://purl.org/dc/terms/> > >> > >> SELECT * WHERE { > >> ?resource a skos:Concept; > >> dct:modified ?modified . > >> } > >> ORDER BY DESC(?modified) > >> LIMIT 100 > >> > >> I tried this with Sesame (native file store), Virtuoso (opensource > >> edition), and AllegroGraph (free edition), and got terrible results > >> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting; > >> skipping the type match gets it down to about 20 seconds *at best*. > >> > >> Sure, it's understandable (AFAIK) that this specific case is much > >> easier to use an SQL DB for (or something like CouchDB). But I was > >> surprised that it was *this* terrible. It seems unfeasible to build > >> e.g. chronological feeds from larger RDF stores using SPARQL with > >> current tools. > >> > >> > >> Is this -- ORDER BY performance -- a commonly known problem, and > >> considered an issue of importance (for academia and implementers > >> alike)? > >> > >> Or am I missing something really obvious in the setup of these > >> stores, > >> or in my query? I welcome *any* suggestions, such as "use triple > >> store > >> X", "for X, make sure to configure indexing on Y". Or do RDF-using > >> service builders in general opt out to indexing in something else > >> entirely in these cases? > >> > >> > >> (It seems queries like this are present in the Berlin SPARQL > >> Benchmark > >> (e.g. #8), but I haven't analyzed this correlation and possible > >> meanings of it in depth.) > >> > >> Best regards, > >> Niklas Lindström > > > > > http://www.langegger.at > ---------------------------------------------------------------------- > Dipl.-Ing.(FH) Andreas Langegger > FAW - Institute for Application-oriented Knowledge Processing > Johannes Kepler University Linz > A-4040 Linz, Altenberger Straße 69 > > > > >
Received on Thursday, 27 August 2009 09:31:01 UTC