- From: M. Scott Marshall <marshall@science.uva.nl>
- Date: Mon, 10 Apr 2006 15:16:04 +0200
- To: Ora Lassila <ora.lassila@nokia.com>
- CC: public-semweb-lifesci@w3.org
Ora Lassila wrote: > what kind of an in-memory database do you use? I have done some preliminary > experiments with UniProt etc. data with about 2 million triples using our > OINK browser (built using the Wilbur toolkit). Performance was very > "interactive" (i.e., "snappy", notice my highly precise metrics here ;-) on > a 1.67 GHZ Powerbook w/ 1 GB RAM. > > I'd like to know what kinds of datasets people are using, what kind of (RDF > triple store) implementations they are using, and what are the observations > about performance. We converted tabular data from UCSC[1] into RDF for querying. As I mentioned in another posting, the files increased by a factor of about 15 although compression does indeed alleviate the problem (e.g. ~800Mb -> ~30Mb). Building (de)compression into your applications should help - it's on our to-do list. Part of the reason for the size expansion in our case is the encoding of tabular structure and declaration of datatypes in XML Schema (e.g. xsd:integer), as well as (long) descriptive namespace tags. But the size of the RDF isn't the biggest challenge.. No surprises here - the performance of loading and querying is not "snappy" ;) for ~11M triples, although different types of queries have, of course, dramatically different performance times. We've had queries run for days. In fact, our largest ~11M data recently took just under 24 hours to load into an RDFS main memory repository (with inferencing and validation on). Of course, many factors affect the time that it takes to load and/or query so I will avoid naming too many timings here because I don't have enough numbers to give a complete impression and I don't want people to draw hasty conclusions about the implementation that we happened to be using (Sesame version 1.2.4 http://openrdf.org ). Several of the factors that affect the speed are: what type of repository, type of index settings (not available for all repository types), the level of query optimization (or lack of it), inferencing settings, and query language constructions used by the user in the query. Ideally, a table of timings would be available that would include factors such as type of repository used, whether it is persisted to disk, and inferencing. I don't think that the scalability problem is limited to Sesame but am getting to know other libraries such as Jena. It certainly makes sense to look at Oracle RDF and SWI Prolog as well, although we don't have time for a complete benchmarking project. It could be interesting to come up with a couple of datasets and queries that HCLSIG memebers could run on their own repositories and post the results to a HCLSIG wiki page to give an idea about the comparative capabilities of repositories. -scott p.s. At a talk in Amsterdam last year, David de Roure of the University of Southampton mentioned numbers like ~70M in the context of the Smart Tea[2] project, I believe. [1] http://genome.ucsc.edu [2] http://www.smarttea.org/ -- M. Scott Marshall tel. +31 (0) 20 525 7765 http://staff.science.uva.nl/~marshall http://integrativebioinformatics.nl/ (website being overhauled and will be XHTML strict this month) Integrative Bioinformatics Unit, University of Amsterdam
Received on Monday, 10 April 2006 13:16:16 UTC