RE: [BioRDF] Scalability

If I were in your position I think that I would probably think seriously
of trying the Oracle product, simply because they have one heck of a lot
of experience building performant data slinging machines and the
description of their techniques looks pretty interesting.

I have received no renumeration from Oracle for this endorsement, if
that's what it is, nor is it based on any actual experience with the
subject matter at hand.  Take it for what it's worth. 

-----Original Message-----
From: public-semweb-lifesci-request@w3.org
[mailto:public-semweb-lifesci-request@w3.org] On Behalf Of M. Scott
Marshall
Sent: Monday, April 10, 2006 8:16 AM
To: Ora Lassila
Cc: public-semweb-lifesci@w3.org
Subject: Re: [BioRDF] Scalability


Ora Lassila wrote:
> what kind of an in-memory database do you use? I have done some 
> preliminary experiments with UniProt etc. data with about 2 million 
> triples using our OINK browser (built using the Wilbur toolkit). 
> Performance was very "interactive" (i.e., "snappy", notice my highly 
> precise metrics here ;-) on a 1.67 GHZ Powerbook w/ 1 GB RAM.
 >
> I'd like to know what kinds of datasets people are using, what kind of

> (RDF triple store) implementations they are using, and what are the 
> observations about performance.

We converted tabular data from UCSC[1] into RDF for querying. As I
mentioned in another posting, the files increased by a factor of about
15 although compression does indeed alleviate the problem (e.g. ~800Mb 
-> ~30Mb). Building (de)compression into your applications should help -
it's on our to-do list. Part of the reason for the size expansion in our
case is the encoding of tabular structure and declaration of datatypes
in XML Schema (e.g. xsd:integer), as well as (long) descriptive
namespace tags. But the size of the RDF isn't the biggest challenge..

No surprises here - the performance of loading and querying is not
"snappy" ;) for ~11M triples, although different types of queries have,
of course, dramatically different performance times. We've had queries
run for days. In fact, our largest ~11M data recently took just under 24
hours to load into an RDFS main memory repository (with inferencing and
validation on). Of course, many factors affect the time that it takes to
load and/or query so I will avoid naming too many timings here because I
don't have enough numbers to give a complete impression and I don't want
people to draw hasty conclusions about the implementation that we
happened to be using (Sesame version 1.2.4 http://openrdf.org ). Several
of the factors that affect the speed are: what type of repository, type
of index settings (not available for all repository types), the level of
query optimization (or lack of it), inferencing settings, and query
language constructions used by the user in the query. Ideally, a table
of timings would be available that would include factors such as type of
repository used, whether it is persisted to disk, and inferencing.

I don't think that the scalability problem is limited to Sesame but am
getting to know other libraries such as Jena. It certainly makes sense
to look at Oracle RDF and SWI Prolog as well, although we don't have
time for a complete benchmarking project. It could be interesting to
come up with a couple of datasets and queries that HCLSIG memebers could
run on their own repositories and post the results to a HCLSIG wiki page
to give an idea about the comparative capabilities of repositories.

-scott

p.s. At a talk in Amsterdam last year, David de Roure of the University
of Southampton mentioned numbers like ~70M in the context of the Smart
Tea[2] project, I believe.

[1] http://genome.ucsc.edu
[2] http://www.smarttea.org/

--
M. Scott Marshall
tel. +31 (0) 20 525 7765
http://staff.science.uva.nl/~marshall
http://integrativebioinformatics.nl/ (website being overhauled and will
be XHTML strict this month) Integrative Bioinformatics Unit, University
of Amsterdam

Received on Monday, 10 April 2006 14:44:02 UTC