- From: David R. Karger <karger@theory.lcs.mit.edu>
- Date: Fri, 9 Aug 2002 00:07:09 -0400
- To: der@hplb.hpl.hp.com
- CC: www-rdf-dspace@w3.org, Nick_Wainwright@hplb.hpl.hp.com, dquan@theory.lcs.mit.edu
I'm going to feed back some intuitions, but I hope dennis will follow up with some responses better grounded in the code. Date: Mon, 05 Aug 2002 12:15:50 +0100 From: Dave Reynolds <der@hplb.hpl.hp.com> X-Accept-Language: en CC: www-rdf-dspace@w3.org, Nick_Wainwright@hplb.hpl.hp.com X-MailScanner: Found to be clean X-SpamBouncer: 1.5 (7/17/02) X-SBPass: No Freemail Filtering X-SBClass: OK An in memory cache fronting an RDB backed persistent store is a common requirement which has come up in several places and would be a good addition to jena. On the technical front the core issue is how locality of reference can be defined in order to perform any useful cache management. For example, in the ePerson case our queries are formulated as tree structured "query by example" patterns. This gives greater selectivity than simple triple patterns and we can do efficient subsumption checking of QBE patterns (in theory at least) so we could in principle build an effective QBE cache. However, for the general low level pointer-following-like API that is not true - the cache reuse would be poor unless we generalize queries and cache the results of those widened queries and yet that widening is hard in the general case. Do you have any characterization of these 45,000 queries which would support some sort of cache hit analysis? My intuition tells me that the right cache for our application is a "graph cache"---namely, a set of resources and the relations incident on those resources. Also could you provide more details on how those queries are generated and then sent to the store? This intuition follows from the idea that most of the queries being issues are of the form "now that I have object X, give me the resource at the other end of predicate P from X". For example, "now that I am holding object X and want to display it, lookup X.type. Now that I have T=X.type, find an element that can be used to display T by finding T.viewers. etc." In the presence of an LRU cache, this would naturally over time cache all the data types (not very many) and all the viewer elements for those types (also not very many). I have the impression you are starting from some high level query formulation, translating those into fine grain requests and then passing those fine grain requests to the store. It might be that if we could move the interface abstraction up a little (as we tried to do with the QBE approach) then caching would be easier. On the non-technical front it is not clear what timescales we could commit to given all the current turmoil. When do you need a scalable solution by? My feeling as that we can continue to develop haystack as a CONCEPT indefinitely using the current nonscalable database. However, we cannot actually PROVE the concept without giving the system to some users to work with their own data. And just about ANY attempt to insert real data into the system is going to crash into the memory limits. (I could be wrong about this, which would be great) Dave "David R. Karger" wrote: > > Dennis sat down and profiled, and found that rendering the Ozone > homepage required 45,000 queries to the RDF store. Cholesterol > handles this in a few seconds; sleepycat is about 60 times slower. > > As I mentioned in my last email, this doesn't mean cholesterol is the > answer to our problems: as an in-memory system, I do not think it will > scale beyond the tiny corpora we are now working with. > > I can imagine the kind of system we need---basically, something like > cholesterol acting as an in-memory cache for something like sleepycat > as the persistent store---but don't think we have the manpower to > build it. > > Any thoughts? > > d > > Date: Fri, 12 Jul 2002 17:55:55 +0100 > From: Dave Reynolds <der@hplb.hpl.hp.com> > X-Accept-Language: en > CC: www-rdf-dspace@w3.org, > "Nick Wainwright (E-mail)" <Nick_Wainwright@hplb.hpl.hp.com> > X-MailScanner: Found to be clean > X-SpamBouncer: 1.5 (6/13/02) > X-SBPass: No Freemail Filtering > X-SBClass: OK > X-Folder: Default > > At yesterday's DSpace telecon we discussed the question of whether RDF databases > as they currently exist could support the "several hundred" small queries per > second needed for Haystack implementations. > > To give a ball park test of this I set up the following test configuration: > o A jena test application that creates a tree-shaped set of RDF assertions with > variable depth and branching factor and then does a set of timed repeated random > walks from root to leaf of the tree. Each step on the walk requires a separate > (very small) database query - no query batching. The randomization of repeated > walks hopefully stresses the caching mechanisms sufficiently to make the test > somewhat realistic. > o I used a branching factor of 10 and depths from 4-6 to test the 10k - 1m > triple range. > o The application and database were running on the same machine (requests still > go through the TCP stack but not out onto the LAN itself). > o The main test machine was 700MHz, single CPU, 512Mb, Linux Red hat 7.2. > > The average time for one micro-query (one step in the random walk) was: > Config #statements time > Mysql 11k 2.8ms > Mysql 111k 3.1ms > Mysql 1,111k 3.8ms > > This is partially CPU bound, preliminary tests on a similarly configured 2GHz > machine were about twice as fast. > > Preliminary figures using postgresql are 2-3 slower than this. > > If these trivial query patterns are indeed representative of Haystack's > requirements then this suggests that 300-600 accesses per second can be achieved > on sub-$1k PCs (ignoring networking issues). > > Loading up 1m statements into a database is another matter however! > > Dave
Received on Friday, 9 August 2002 00:05:56 UTC