- From: Seaborne, Andy <Andy_Seaborne@hplb.hpl.hp.com>
- Date: Thu, 29 Jan 2004 15:18:46 -0000
- To: www-rdf-dspace@w3.org
> ----Original Message---- > From: Butler, Mark <mailto:Mark_Butler@hplb.hpl.hp.com> > Date: 28 January 2004 13:58 > > Hi Team, > > I now have a pure RDQL query version of the Longwell query engine. > Here are some performance figures comparing Lucene, the Jena API and > pure RDQL for a local, in-memory model containing 1/34th of the > Artstor data: > > RDQL Jena Lucene > Time to load data 39139 ms 39000 ms 39057 ms > Time to index data 291 ms 60 ms 54747 ms > Time to query data 23748 ms 940 ms 1224 ms For your inner loop, Jena access will incur maybe one or two hash table lookups per value. i.e it is very low overhead, given there are no direct access patterns. When you use RDQL, every single value access is incuring parser overhead - javaCC and object construction. Parsing a query is many times more expensive than a couple of hash table lookups. You are doing the equivalent of selecting each cell (row/col combination) one access at a time. Better to get the whole row. You can also used preparsed queries - there is an API call to set the context of the query so you can bind some variables later than parse time (client template queries - same idea as most SQL systems). > So as you can see, using RDQL is around 24 times as slow as using the > Jena API. The main reason for this is we are dealing with > semistructured data, so as we are not certain that all properties will > exist we have to query those properties serially. To recap from my > previous post, first we determine which resources meet our > constraints. For example if our constraints are > > rdf:type = vra:Image > vra:subject vraTopic:Architecture_artist > > then we use a query like this (omitting USING for brevity) ... > > SELECT ?a, > WHERE (?a , rdf:type , vra:Image ) , > (?a , vra:subject, vraTopic:Architecture_artist) > > then we query each member of ?a serially e.g. > > foreach uri in ?a > SELECT ?b WHERE ( uri , rdf:type , ?b ) > SELECT ?b WHERE ( uri , ims:aggregationlevel, ?b) > SELECT ?b WHERE ( uri , vra:subject, ?b) > SELECT ?b WHERE ( uri , vra:creator, ?b) > SELECT ?b WHERE ( uri , art:topic, ?b) > SELECT ?b WHERE ( uri , art:geographic, ?b) > SELECT ?b WHERE ( uri , dc:subject, ?b) > end > > rather than in parallel e.g. > > foreach uri in ?a > SELECT ?b, ?c, ?d, ?e, ?f, ?g, ?h > WHERE ( uri , rdf:type , ?b ) , > ( uri , ims:aggregationlevel, ?c) , > ( uri , vra:subject, ?d) , > ( uri , vra:creator, ?e) , > ( uri , art:topic, ?f) , > ( uri , art:geographic, ?g) , > ( uri , dc:subject, ?h) > end Quick fix after a look at the code: do them in parallel: SELECT ?b WHERE ( uri , ?p , ?b ) It is better to get a larger chunk with more information because the system will be able to grab that chunk in one go, rather than pick things out without any help to the system as to the overall intent. This will be very important when you move to a database. This applies throughout Longwell not just at this point. The proper change is to review the access patterns and get larger chunks from the DB. For in-memory use, this won't make a different; moving to a disk DB, it will. - - - - - - The Joseki "fetch" style access is for exactly this case - get everything known about a resource. The exact data returned can be controlled by the "fetch" moudle invoked - if you need application-specific code, you can write a module although better appraoch is to find out about a resource and them select the information needed at the client. The overhead of any extra data will have to be quite high before it is noticable. > > Stefano was asking for some whitepapers here, so some good references > on semistructured databases and query languages are [1] [2] [3]. For > more references related to SIMILE, see [4]. One language, Lorel, > described in [1] makes it possible to perform queries parallel > queries because SELECT is more > flexible: you use SELECT you to specify what you want to retrieve and > you put the constraints in WHERE. A more Lorel-like query syntax would > allow us to perform our task in a single query e.g. > > SELECT (?a , ims:aggregationlevel, ?b) , > (?a , vra:subject, ?c) , > (?a , vra:creator, ?d) , > (?a , art:topic, ?e) , > (?a , art:geographic, ?f) , > (?a , dc:subject, ?g) > WHERE (?a , rdf:type , vra:Image ) , > (?a , vra:subject, vraTopic:Architecture_artist) > > Comments or other proposals? > > [1] "An Overview of Semistructured Data." Suciu. SIGACTN: SIGACT News > (ACM Special Interest Group on Automata and Computability Theory). > vol. 29. 1998. http://citeseer.nj.nec.com/160105.html > > [2] "Querying Semi-Structured Data." Serge Abiteboul. ICDT. 1997. pp. > 1-18. http://citeseer.nj.nec.com/abiteboul97querying.html > > [3] "Semistructured data." Peter Buneman. 1997. pp. 117--121. > http://citeseer.nj.nec.com/buneman97semistructured.html > > [4] > http://web.mit.edu/simile/www/documents/researchDrivers/simileBib.html > > Mark Butler > Research Scientist > HP Labs Bristol > http://www-uk.hpl.hp.com/people/marbut
Received on Thursday, 29 January 2004 10:19:52 UTC