Re: Jena database performance from David R. Karger on 2002-08-09 (www-rdf-dspace@w3.org from August 2002)

From: David R. Karger <karger@theory.lcs.mit.edu>
Date: Fri, 9 Aug 2002 00:07:09 -0400
To: der@hplb.hpl.hp.com
CC: www-rdf-dspace@w3.org, Nick_Wainwright@hplb.hpl.hp.com, dquan@theory.lcs.mit.edu
Message-Id: <200208090407.g79479M07371@harrier.lcs.mit.edu>
I'm going to feed back some intuitions, but I hope dennis will follow
up with some responses better grounded in the code.

   Date: Mon, 05 Aug 2002 12:15:50 +0100
   From: Dave Reynolds <der@hplb.hpl.hp.com>
   X-Accept-Language: en
   CC: www-rdf-dspace@w3.org, Nick_Wainwright@hplb.hpl.hp.com
   X-MailScanner: Found to be clean
   X-SpamBouncer: 1.5 (7/17/02)
   X-SBPass: No Freemail Filtering
   X-SBClass: OK

   An in memory cache fronting an RDB backed persistent store is a
   common requirement which has come up in several places and would be
   a good addition to jena.

   On the technical front the core issue is how locality of reference
   can be defined in order to perform any useful cache management. For
   example, in the ePerson case our queries are formulated as tree
   structured "query by example" patterns. This gives greater
   selectivity than simple triple patterns and we can do efficient
   subsumption checking of QBE patterns (in theory at least) so we
   could in principle build an effective QBE cache. However, for the
   general low level pointer-following-like API that is not true - the
   cache reuse would be poor unless we generalize queries and cache
   the results of those widened queries and yet that widening is hard
   in the general case.

   Do you have any characterization of these 45,000 queries which
   would support some sort of cache hit analysis?

My intuition tells me that the right cache for our application is a
"graph cache"---namely, a set of resources and the relations incident
on those resources.  

   Also could you provide more details on how those queries are
   generated and then sent to the store? 

This intuition follows from the idea that most of
the queries being issues are of the form "now that I have object X,
give me the resource at the other end of predicate P from X".  For
example, "now that I am holding object X and want to display it,
lookup X.type.  Now that I have T=X.type, find an element that can be
used to display T by finding T.viewers.  etc."   In the presence of an
LRU cache, this would naturally over time cache all the data types
(not very many) and all the viewer elements for those types (also not
very many).  

   I have the impression you are starting from some high level query
   formulation, translating those into fine grain requests and then
   passing those fine grain requests to the store. It might be that if
   we could move the interface abstraction up a little (as we tried to
   do with the QBE approach) then caching would be easier.

   On the non-technical front it is not clear what timescales we could
   commit to given all the current turmoil. When do you need a
   scalable solution by?

My feeling as that we can continue to develop haystack as a CONCEPT
indefinitely using the current nonscalable database.  However, we
cannot actually PROVE the concept without giving the system to some
users to work  with their own data.  And just about ANY attempt to
insert real data into the system is going to crash into the
memory limits.  (I could be wrong about this, which would be great)

   Dave

   "David R. Karger" wrote:
   > 
   > Dennis sat down and profiled, and found that rendering the Ozone
   > homepage required 45,000 queries to the RDF store.  Cholesterol
   > handles this in a few seconds; sleepycat is about 60 times slower.
   > 
   > As I mentioned in my last email, this doesn't mean cholesterol is the
   > answer to our problems: as an in-memory system, I do not think it will
   > scale beyond the tiny corpora we are now working with.
   > 
   > I can imagine the kind of system we need---basically, something like
   > cholesterol acting as an in-memory cache for something like sleepycat
   > as the persistent store---but don't think we have the manpower to
   > build it.
   > 
   > Any thoughts?
   > 
   > d
   > 
   >    Date: Fri, 12 Jul 2002 17:55:55 +0100
   >    From: Dave Reynolds <der@hplb.hpl.hp.com>
   >    X-Accept-Language: en
   >    CC: www-rdf-dspace@w3.org,
   >       "Nick Wainwright (E-mail)" <Nick_Wainwright@hplb.hpl.hp.com>
   >    X-MailScanner: Found to be clean
   >    X-SpamBouncer: 1.5 (6/13/02)
   >    X-SBPass: No Freemail Filtering
   >    X-SBClass: OK
   >    X-Folder: Default
   > 
   >    At yesterday's DSpace telecon we discussed the question of whether RDF databases
   >    as they currently exist could support the "several hundred" small queries per
   >    second needed for Haystack implementations.
   > 
   >    To give a ball park test of this I set up the following test configuration:
   >     o A jena test application that creates a tree-shaped set of RDF assertions with
   >    variable depth and branching factor and then does a set of timed repeated random
   >    walks from root to leaf of the tree. Each step on the walk requires a separate
   >    (very small) database query - no query batching. The randomization of repeated
   >    walks hopefully stresses the caching mechanisms sufficiently to make the test
   >    somewhat realistic.
   >     o I used a branching factor of 10 and depths from 4-6 to test the 10k - 1m
   >    triple range.
   >     o The application and database were running on the same machine (requests still
   >    go through the TCP stack but not out onto the LAN itself).
   >     o The main test machine was 700MHz, single CPU, 512Mb, Linux Red hat 7.2.
   > 
   >    The average time for one micro-query (one step in the random walk) was:
   >           Config           #statements  time
   >     Mysql                      11k       2.8ms
   >     Mysql                     111k       3.1ms
   >     Mysql                   1,111k       3.8ms
   > 
   >    This is partially CPU bound, preliminary tests on a similarly configured 2GHz
   >    machine were about twice as fast.
   > 
   >    Preliminary figures using postgresql are 2-3 slower than this.
   > 
   >    If these trivial query patterns are indeed representative of Haystack's
   >    requirements then this suggests that 300-600 accesses per second can be achieved
   >    on sub-$1k PCs (ignoring networking issues).
   > 
   >    Loading up 1m statements into a database is another matter however!
   > 
   >    Dave
Received on Friday, 9 August 2002 00:05:56 UTC