- From: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
- Date: Fri, 13 Feb 2009 14:15:38 +0000
- To: Andrea Splendiani <andrea.splendiani@bbsrc.ac.uk>
- Cc: public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>
Hi Andrea, We have some data on query performance for Jena TDB and Virtuoso over a ~9 million triple dataset derived from FlyBase, the Drosophila model organism database, see: http://code.google.com/p/openflydata/wiki/BenchmarkResultsQueryFlyBaseGeneNames20081027 Some caveats about interpreting the results... The benchmarks were done on quite different hardware, plus TDB queries have a startup cost associated with launching the jvm for the first time. However, they might give you a rough idea at least. For load performance, we get ~6,000 triples per second loading the flybase dataset to a Jena TDB store using a small amazon EC2 instance (32-bit hardware) when the data is loaded to an attached EBS volume. Loading is slower to the internal instance storage (1-2,000 tps), it seems like a general rule that performance is better if the data is being loaded to a separate disk from anything the OS or other processes might be using. Note that amazon suggest striping large datasets across multiple attached EBS volumes. However for our current level of scaling 1 volume seems to perform fine for several datasets. Note that TDB is designed for 64-bit platforms, so performs much better on 64-bit hardware -- Andy Seaborne at HP loaded our flybase dataset on a 64-bit HP blade server getting average load rate of >30,000 triples per second. I'm afraid we don't have any data on virtuoso load performance. The flybase dataset doesn't contain any of the sequence data, and we haven't tried querying residue sequences from within SPARQL yet. However, this is something we will need to do for a related project very shortly, so any experience or ideas you have would be very interesting to us. Querying residue sequences is not the most natural thing for SPARQL, so I anticipate some performance issues. Wrt to microarray data we have done an RDF representation of tissue-specific microarray data from www.flyatlas.org. The dataset is ~2 million triples. For information on the most recent release of these datasets and sparql endpoints see: http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb/MilestoneTwo If you'd like to get a quick feel for how Jena TDB performs you could check out our gene expression data search application for Drosophila at: http://openflydata.org/search/gene-expression All data is retrieved on-the-fly (pardon the pun) via SPARQL queries to four separate endpoints backed by TDB. Cheers, Alistair On Thu, Feb 12, 2009 at 03:40:41PM +0100, Andrea Splendiani wrote: > > Hi, > > In the context of a data-integration project, I'm doing some preliminary > analysis to see whether it makes sense to use a triple-store to setup a > backend/repository. > I have some experience with Jena, and In know projects making use of > Virtuoso or Sesame. > However, I'm not aware of a review/benchmark of these systems, both > regarding performances and features. > I've seen a few links like: > > http://esw.w3.org/topic/LargeTripleStores > > or > > http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html > > But I would like to know how these systems scale with large knowledge- > base (load/query). > > I wold also like to get some rough intuition on how much it makes sense > to store data such as sequences and microarray values in them, and how > sparql is usable to query based on these values. > > Is there anyone that can provide me with some good pointers ? > > Or is this some area that you think needs more exploration ? > It seems to me that to the question "why did you use this triplestore ?", > the usual answer is "I'e tried a few and this worked". > > best, > Andrea Splendiani > > > -- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.miles@zoo.ox.ac.uk Tel: +44 (0)1865 281993
Received on Friday, 13 February 2009 14:16:14 UTC