Re: Jena database performance from David R. Karger on 2002-08-02 (www-rdf-dspace@w3.org from August 2002)

From: David R. Karger <karger@theory.lcs.mit.edu>
Date: Fri, 2 Aug 2002 18:20:59 -0400
To: der@hplb.hpl.hp.com
CC: www-rdf-dspace@w3.org, Nick_Wainwright@hplb.hpl.hp.com
Message-Id: <200208022220.g72MKxw12583@harrier.lcs.mit.edu>

Dennis sat down and profiled, and found that rendering the Ozone
homepage required 45,000 queries to the RDF store.  Cholesterol
handles this in a few seconds; sleepycat is about 60 times slower.  

As I mentioned in my last email, this doesn't mean cholesterol is the
answer to our problems: as an in-memory system, I do not think it will
scale beyond the tiny corpora we are now working with.

I can imagine the kind of system we need---basically, something like
cholesterol acting as an in-memory cache for something like sleepycat
as the persistent store---but don't think we have the manpower to
build it.

Any thoughts?

d

   Date: Fri, 12 Jul 2002 17:55:55 +0100
   From: Dave Reynolds <der@hplb.hpl.hp.com>
   X-Accept-Language: en
   CC: www-rdf-dspace@w3.org,
      "Nick Wainwright (E-mail)" <Nick_Wainwright@hplb.hpl.hp.com>
   X-MailScanner: Found to be clean
   X-SpamBouncer: 1.5 (6/13/02)
   X-SBPass: No Freemail Filtering
   X-SBClass: OK
   X-Folder: Default

   At yesterday's DSpace telecon we discussed the question of whether RDF databases
   as they currently exist could support the "several hundred" small queries per
   second needed for Haystack implementations.

   To give a ball park test of this I set up the following test configuration:
    o A jena test application that creates a tree-shaped set of RDF assertions with
   variable depth and branching factor and then does a set of timed repeated random
   walks from root to leaf of the tree. Each step on the walk requires a separate
   (very small) database query - no query batching. The randomization of repeated
   walks hopefully stresses the caching mechanisms sufficiently to make the test
   somewhat realistic.
    o I used a branching factor of 10 and depths from 4-6 to test the 10k - 1m
   triple range.
    o The application and database were running on the same machine (requests still
   go through the TCP stack but not out onto the LAN itself).
    o The main test machine was 700MHz, single CPU, 512Mb, Linux Red hat 7.2.

   The average time for one micro-query (one step in the random walk) was:
	  Config           #statements  time
    Mysql                      11k       2.8ms
    Mysql                     111k       3.1ms
    Mysql                   1,111k       3.8ms

   This is partially CPU bound, preliminary tests on a similarly configured 2GHz
   machine were about twice as fast.

   Preliminary figures using postgresql are 2-3 slower than this.

   If these trivial query patterns are indeed representative of Haystack's
   requirements then this suggests that 300-600 accesses per second can be achieved
   on sub-$1k PCs (ignoring networking issues).

   Loading up 1m statements into a database is another matter however!

   Dave

Received on Friday, 2 August 2002 18:22:37 UTC