Re: Is there a benchmark of triple-stores with a "bias" to Life Sciences ?

Hi Andrea,

We have some data on query performance for Jena TDB and Virtuoso over
a ~9 million triple dataset derived from FlyBase, the Drosophila model
organism database, see:

http://code.google.com/p/openflydata/wiki/BenchmarkResultsQueryFlyBaseGeneNames20081027

Some caveats about interpreting the results... The benchmarks were
done on quite different hardware, plus TDB queries have a startup cost
associated with launching the jvm for the first time. However, they
might give you a rough idea at least.

For load performance, we get ~6,000 triples per second loading the
flybase dataset to a Jena TDB store using a small amazon EC2 instance
(32-bit hardware) when the data is loaded to an attached EBS
volume. Loading is slower to the internal instance storage (1-2,000
tps), it seems like a general rule that performance is better if the
data is being loaded to a separate disk from anything the OS or other
processes might be using. Note that amazon suggest striping large
datasets across multiple attached EBS volumes. However for our current
level of scaling 1 volume seems to perform fine for several datasets.

Note that TDB is designed for 64-bit platforms, so performs much
better on 64-bit hardware -- Andy Seaborne at HP loaded our flybase
dataset on a 64-bit HP blade server getting average load rate of
>30,000 triples per second.

I'm afraid we don't have any data on virtuoso load performance.

The flybase dataset doesn't contain any of the sequence data, and we
haven't tried querying residue sequences from within SPARQL
yet. However, this is something we will need to do for a related
project very shortly, so any experience or ideas you have would be
very interesting to us. Querying residue sequences is not the most
natural thing for SPARQL, so I anticipate some performance issues.

Wrt to microarray data we have done an RDF representation of
tissue-specific microarray data from www.flyatlas.org. The dataset is
~2 million triples. For information on the most recent release of
these datasets and sparql endpoints see:

http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb/MilestoneTwo

If you'd like to get a quick feel for how Jena TDB performs you could
check out our gene expression data search application for Drosophila
at:

http://openflydata.org/search/gene-expression

All data is retrieved on-the-fly (pardon the pun) via SPARQL queries
to four separate endpoints backed by TDB.

Cheers,

Alistair

On Thu, Feb 12, 2009 at 03:40:41PM +0100, Andrea Splendiani wrote:
>
> Hi,
>
> In the context of a data-integration project, I'm doing some preliminary 
> analysis to see whether it makes sense to use a triple-store to setup a 
> backend/repository.
> I have some experience with Jena, and In know projects making use of  
> Virtuoso or Sesame.
> However, I'm not aware of a review/benchmark of these systems, both  
> regarding performances and features.
> I've seen a few links like:
>
> http://esw.w3.org/topic/LargeTripleStores
>
> or
>
> http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html
>
> But I would like to know how these systems scale with large knowledge- 
> base (load/query).
>
> I wold also like to get some rough intuition on how much it makes sense 
> to store data such as sequences and microarray values in them, and how 
> sparql is usable to query based on these values.
>
> Is there anyone that can provide me with some good pointers ?
>
> Or is this some area that you think needs more exploration ?
> It seems to me that to the question "why did you use this triplestore ?", 
> the usual answer is "I'e tried a few and this worked".
>
> best,
> Andrea Splendiani
>
>
>

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: alistair.miles@zoo.ox.ac.uk
Tel: +44 (0)1865 281993

Received on Friday, 13 February 2009 14:16:14 UTC