- From: Andreas Langegger <al@jku.at>
- Date: Thu, 27 Aug 2009 11:39:45 +0200
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- Cc: Semantic Web <semantic-web@w3.org>
yes, since D2R pushes down filters now it will use available indexes at the RDB level. The data Niklas wants to sparql-query are native RDF. Thus, in order to use D2R over RDBMS with indexes, it would require him to transform all data back into SQL tables, how evil... ;-) The idea was, whould it be possible to define partial indexes for native RDF stores such as TDB? s p o ------------ :p ^ :p | :p | index over :p :p v :q ^ :q | index over :q (and same object ranges, e.g. xsd:dateTime) :q v ... regards AndyL On Aug 27, 2009, at 11:29 AM, Seaborne, Andy wrote: > > Not just triple stores. Does D2RQ enable this? The data physical > organisation could have a dct:modified table with dateTime sorted > values and the ORDER BY is not more that choosing the right end of > the table to start at. > > Andy > >> >> >>> >>> >>> Talking of data validity - in the data, all the timezones, >>> regardless of time of year, are -04:00, which is a strangeness. The >>> data was generated June 2009. Hmm. >>> >>> e.g. "2001-01-29T00:00:00- >> 04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime >>>> >>> >>> Andy >>> >>>> -----Original Message----- >>>> From: semantic-web-request@w3.org [mailto:semantic-web- >>>> request@w3.org] On >>>> Behalf Of Niklas Lindström >>>> Sent: 26 August 2009 19:16 >>>> To: Semantic Web >>>> Subject: SPARQL performance for ORDER BY on large datasets >>>> >>>> Hi all! >>>> >>>> I have a straightforward use case which seems really hard for >>>> triple >>>> stores to perform for using SPARQL (on huge datasets). >>>> >>>> The case: select a set of resources (based on simple criteria >>>> such as >>>> type) and order them based on a property value (limiting the >>>> results >>>> to a decent batch size). >>>> >>>> Data for test: the Library of Congress RDF dump of subject headings >>>> at >>>> <http://id.loc.gov/authorities/search/> (~365 Mb of RDF/XML, 3 703 >>>> 621 >>>> statements). >>>> >>>> The query: >>>> >>>> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> >>>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> >>>> PREFIX dct: <http://purl.org/dc/terms/> >>>> >>>> SELECT * WHERE { >>>> ?resource a skos:Concept; >>>> dct:modified ?modified . >>>> } >>>> ORDER BY DESC(?modified) >>>> LIMIT 100 >>>> >>>> I tried this with Sesame (native file store), Virtuoso (opensource >>>> edition), and AllegroGraph (free edition), and got terrible results >>>> (on a 2.8 GHz Intel Core 2 Duo). More minutes than worth counting; >>>> skipping the type match gets it down to about 20 seconds *at best*. >>>> >>>> Sure, it's understandable (AFAIK) that this specific case is much >>>> easier to use an SQL DB for (or something like CouchDB). But I was >>>> surprised that it was *this* terrible. It seems unfeasible to build >>>> e.g. chronological feeds from larger RDF stores using SPARQL with >>>> current tools. >>>> >>>> >>>> Is this -- ORDER BY performance -- a commonly known problem, and >>>> considered an issue of importance (for academia and implementers >>>> alike)? >>>> >>>> Or am I missing something really obvious in the setup of these >>>> stores, >>>> or in my query? I welcome *any* suggestions, such as "use triple >>>> store >>>> X", "for X, make sure to configure indexing on Y". Or do RDF-using >>>> service builders in general opt out to indexing in something else >>>> entirely in these cases? >>>> >>>> >>>> (It seems queries like this are present in the Berlin SPARQL >>>> Benchmark >>>> (e.g. #8), but I haven't analyzed this correlation and possible >>>> meanings of it in depth.) >>>> >>>> Best regards, >>>> Niklas Lindström >>> >> >> >> http://www.langegger.at >> ---------------------------------------------------------------------- >> Dipl.-Ing.(FH) Andreas Langegger >> FAW - Institute for Application-oriented Knowledge Processing >> Johannes Kepler University Linz >> A-4040 Linz, Altenberger Straße 69 >> >> >> >> >> > http://www.langegger.at ---------------------------------------------------------------------- Dipl.-Ing.(FH) Andreas Langegger FAW - Institute for Application-oriented Knowledge Processing Johannes Kepler University Linz A-4040 Linz, Altenberger Straße 69
Received on Thursday, 27 August 2009 09:40:32 UTC