Re: [BioRDF] Scalability from M. Scott Marshall on 2006-04-10 (public-semweb-lifesci@w3.org from April 2006)

From: M. Scott Marshall <marshall@science.uva.nl>
Date: Mon, 10 Apr 2006 15:16:04 +0200
To: Ora Lassila <ora.lassila@nokia.com>
CC: public-semweb-lifesci@w3.org
Message-ID: <443A5A94.7020606@science.uva.nl>
Ora Lassila wrote:
> what kind of an in-memory database do you use? I have done some preliminary
> experiments with UniProt etc. data with about 2 million triples using our
> OINK browser (built using the Wilbur toolkit). Performance was very
> "interactive" (i.e., "snappy", notice my highly precise metrics here ;-) on
> a 1.67 GHZ Powerbook w/ 1 GB RAM.
 >
> I'd like to know what kinds of datasets people are using, what kind of (RDF
> triple store) implementations they are using, and what are the observations
> about performance.

We converted tabular data from UCSC[1] into RDF for querying. As I 
mentioned in another posting, the files increased by a factor of about 
15 although compression does indeed alleviate the problem (e.g. ~800Mb 
-> ~30Mb). Building (de)compression into your applications should help - 
it's on our to-do list. Part of the reason for the size expansion in our 
case is the encoding of tabular structure and declaration of datatypes 
in XML Schema (e.g. xsd:integer), as well as (long) descriptive 
namespace tags. But the size of the RDF isn't the biggest challenge..

No surprises here - the performance of loading and querying is not 
"snappy" ;) for ~11M triples, although different types of queries have, 
of course, dramatically different performance times. We've had queries 
run for days. In fact, our largest ~11M data recently took just under 24 
hours to load into an RDFS main memory repository (with inferencing and 
validation on). Of course, many factors affect the time that it takes to 
load and/or query so I will avoid naming too many timings here because I 
don't have enough numbers to give a complete impression and I don't want 
people to draw hasty conclusions about the implementation that we 
happened to be using (Sesame version 1.2.4 http://openrdf.org ). Several 
of the factors that affect the speed are: what type of repository, type 
of index settings (not available for all repository types), the level of 
query optimization (or lack of it), inferencing settings, and query 
language constructions used by the user in the query. Ideally, a table 
of timings would be available that would include factors such as type of 
repository used, whether it is persisted to disk, and inferencing.

I don't think that the scalability problem is limited to Sesame but am 
getting to know other libraries such as Jena. It certainly makes sense 
to look at Oracle RDF and SWI Prolog as well, although we don't have 
time for a complete benchmarking project. It could be interesting to 
come up with a couple of datasets and queries that HCLSIG memebers could 
run on their own repositories and post the results to a HCLSIG wiki page 
to give an idea about the comparative capabilities of repositories.

-scott

p.s. At a talk in Amsterdam last year, David de Roure of the University 
of Southampton mentioned numbers like ~70M in the context of the Smart 
Tea[2] project, I believe.

[1] http://genome.ucsc.edu
[2] http://www.smarttea.org/

-- 
M. Scott Marshall
tel. +31 (0) 20 525 7765
http://staff.science.uva.nl/~marshall
http://integrativebioinformatics.nl/ (website being overhauled and will 
be XHTML strict this month)
Integrative Bioinformatics Unit, University of Amsterdam
Received on Monday, 10 April 2006 13:16:16 UTC