Re: Jena database performance from Dave Reynolds on 2002-08-05 (www-rdf-dspace@w3.org from August 2002)

From: Dave Reynolds <der@hplb.hpl.hp.com>
Date: Mon, 05 Aug 2002 12:15:50 +0100
To: karger@theory.lcs.mit.edu
CC: www-rdf-dspace@w3.org, Nick_Wainwright@hplb.hpl.hp.com
Message-ID: <3D4E5E66.1DAA4646@hplb.hpl.hp.com>
An in memory cache fronting an RDB backed persistent store is a common
requirement which has come up in several places and would be a good addition to
jena. 

On the technical front the core issue is how locality of reference can be
defined in order to perform any useful cache management. For example, in the
ePerson case our queries are formulated as tree structured "query by example"
patterns. This gives greater selectivity than simple triple patterns and we can
do efficient subsumption checking of QBE patterns (in theory at least) so we
could in principle build an effective QBE cache. However, for the general low
level pointer-following-like API that is not true - the cache reuse would be
poor unless we generalize queries and cache the results of those widened queries
and yet that widening is hard in the general case.

Do you have any characterization of these 45,000 queries which would support
some sort of cache hit analysis? 

Also could you provide more details on how those queries are generated and then
sent to the store? I have the impression you are starting from some high level
query formulation, translating those into fine grain requests and then passing
those fine grain requests to the store. It might be that if we could move the
interface abstraction up a little (as we tried to do with the QBE approach) then
caching would be easier.

On the non-technical front it is not clear what timescales we could commit to
given all the current turmoil. When do you need a scalable solution by?

Dave

"David R. Karger" wrote:
> 
> Dennis sat down and profiled, and found that rendering the Ozone
> homepage required 45,000 queries to the RDF store.  Cholesterol
> handles this in a few seconds; sleepycat is about 60 times slower.
> 
> As I mentioned in my last email, this doesn't mean cholesterol is the
> answer to our problems: as an in-memory system, I do not think it will
> scale beyond the tiny corpora we are now working with.
> 
> I can imagine the kind of system we need---basically, something like
> cholesterol acting as an in-memory cache for something like sleepycat
> as the persistent store---but don't think we have the manpower to
> build it.
> 
> Any thoughts?
> 
> d
> 
>    Date: Fri, 12 Jul 2002 17:55:55 +0100
>    From: Dave Reynolds <der@hplb.hpl.hp.com>
>    X-Accept-Language: en
>    CC: www-rdf-dspace@w3.org,
>       "Nick Wainwright (E-mail)" <Nick_Wainwright@hplb.hpl.hp.com>
>    X-MailScanner: Found to be clean
>    X-SpamBouncer: 1.5 (6/13/02)
>    X-SBPass: No Freemail Filtering
>    X-SBClass: OK
>    X-Folder: Default
> 
>    At yesterday's DSpace telecon we discussed the question of whether RDF databases
>    as they currently exist could support the "several hundred" small queries per
>    second needed for Haystack implementations.
> 
>    To give a ball park test of this I set up the following test configuration:
>     o A jena test application that creates a tree-shaped set of RDF assertions with
>    variable depth and branching factor and then does a set of timed repeated random
>    walks from root to leaf of the tree. Each step on the walk requires a separate
>    (very small) database query - no query batching. The randomization of repeated
>    walks hopefully stresses the caching mechanisms sufficiently to make the test
>    somewhat realistic.
>     o I used a branching factor of 10 and depths from 4-6 to test the 10k - 1m
>    triple range.
>     o The application and database were running on the same machine (requests still
>    go through the TCP stack but not out onto the LAN itself).
>     o The main test machine was 700MHz, single CPU, 512Mb, Linux Red hat 7.2.
> 
>    The average time for one micro-query (one step in the random walk) was:
>           Config           #statements  time
>     Mysql                      11k       2.8ms
>     Mysql                     111k       3.1ms
>     Mysql                   1,111k       3.8ms
> 
>    This is partially CPU bound, preliminary tests on a similarly configured 2GHz
>    machine were about twice as fast.
> 
>    Preliminary figures using postgresql are 2-3 slower than this.
> 
>    If these trivial query patterns are indeed representative of Haystack's
>    requirements then this suggests that 300-600 accesses per second can be achieved
>    on sub-$1k PCs (ignoring networking issues).
> 
>    Loading up 1m statements into a database is another matter however!
> 
>    Dave
Received on Monday, 5 August 2002 07:16:18 UTC