- From: Cutler, Roger (RogerCutler) <RogerCutler@chevron.com>
- Date: Mon, 10 Apr 2006 09:43:15 -0500
- To: "M. Scott Marshall" <marshall@science.uva.nl>, "Ora Lassila" <ora.lassila@nokia.com>
- cc: public-semweb-lifesci@w3.org
If I were in your position I think that I would probably think seriously of trying the Oracle product, simply because they have one heck of a lot of experience building performant data slinging machines and the description of their techniques looks pretty interesting. I have received no renumeration from Oracle for this endorsement, if that's what it is, nor is it based on any actual experience with the subject matter at hand. Take it for what it's worth. -----Original Message----- From: public-semweb-lifesci-request@w3.org [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of M. Scott Marshall Sent: Monday, April 10, 2006 8:16 AM To: Ora Lassila Cc: public-semweb-lifesci@w3.org Subject: Re: [BioRDF] Scalability Ora Lassila wrote: > what kind of an in-memory database do you use? I have done some > preliminary experiments with UniProt etc. data with about 2 million > triples using our OINK browser (built using the Wilbur toolkit). > Performance was very "interactive" (i.e., "snappy", notice my highly > precise metrics here ;-) on a 1.67 GHZ Powerbook w/ 1 GB RAM. > > I'd like to know what kinds of datasets people are using, what kind of > (RDF triple store) implementations they are using, and what are the > observations about performance. We converted tabular data from UCSC[1] into RDF for querying. As I mentioned in another posting, the files increased by a factor of about 15 although compression does indeed alleviate the problem (e.g. ~800Mb -> ~30Mb). Building (de)compression into your applications should help - it's on our to-do list. Part of the reason for the size expansion in our case is the encoding of tabular structure and declaration of datatypes in XML Schema (e.g. xsd:integer), as well as (long) descriptive namespace tags. But the size of the RDF isn't the biggest challenge.. No surprises here - the performance of loading and querying is not "snappy" ;) for ~11M triples, although different types of queries have, of course, dramatically different performance times. We've had queries run for days. In fact, our largest ~11M data recently took just under 24 hours to load into an RDFS main memory repository (with inferencing and validation on). Of course, many factors affect the time that it takes to load and/or query so I will avoid naming too many timings here because I don't have enough numbers to give a complete impression and I don't want people to draw hasty conclusions about the implementation that we happened to be using (Sesame version 1.2.4 http://openrdf.org ). Several of the factors that affect the speed are: what type of repository, type of index settings (not available for all repository types), the level of query optimization (or lack of it), inferencing settings, and query language constructions used by the user in the query. Ideally, a table of timings would be available that would include factors such as type of repository used, whether it is persisted to disk, and inferencing. I don't think that the scalability problem is limited to Sesame but am getting to know other libraries such as Jena. It certainly makes sense to look at Oracle RDF and SWI Prolog as well, although we don't have time for a complete benchmarking project. It could be interesting to come up with a couple of datasets and queries that HCLSIG memebers could run on their own repositories and post the results to a HCLSIG wiki page to give an idea about the comparative capabilities of repositories. -scott p.s. At a talk in Amsterdam last year, David de Roure of the University of Southampton mentioned numbers like ~70M in the context of the Smart Tea[2] project, I believe. [1] http://genome.ucsc.edu [2] http://www.smarttea.org/ -- M. Scott Marshall tel. +31 (0) 20 525 7765 http://staff.science.uva.nl/~marshall http://integrativebioinformatics.nl/ (website being overhauled and will be XHTML strict this month) Integrative Bioinformatics Unit, University of Amsterdam
Received on Monday, 10 April 2006 14:44:02 UTC