- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Wed, 28 Jan 2004 13:57:16 -0000
- To: " (www-rdf-dspace@w3.org)" <www-rdf-dspace@w3.org>
Hi Team, I now have a pure RDQL query version of the Longwell query engine. Here are some performance figures comparing Lucene, the Jena API and pure RDQL for a local, in-memory model containing 1/34th of the Artstor data: RDQL Jena Lucene Time to load data 39139 ms 39000 ms 39057 ms Time to index data 291 ms 60 ms 54747 ms Time to query data 23748 ms 940 ms 1224 ms So as you can see, using RDQL is around 24 times as slow as using the Jena API. The main reason for this is we are dealing with semistructured data, so as we are not certain that all properties will exist we have to query those properties serially. To recap from my previous post, first we determine which resources meet our constraints. For example if our constraints are rdf:type = vra:Image vra:subject vraTopic:Architecture_artist then we use a query like this (omitting USING for brevity) ... SELECT ?a, WHERE (?a , rdf:type , vra:Image ) , (?a , vra:subject, vraTopic:Architecture_artist) then we query each member of ?a serially e.g. foreach uri in ?a SELECT ?b WHERE ( uri , rdf:type , ?b ) SELECT ?b WHERE ( uri , ims:aggregationlevel, ?b) SELECT ?b WHERE ( uri , vra:subject, ?b) SELECT ?b WHERE ( uri , vra:creator, ?b) SELECT ?b WHERE ( uri , art:topic, ?b) SELECT ?b WHERE ( uri , art:geographic, ?b) SELECT ?b WHERE ( uri , dc:subject, ?b) end rather than in parallel e.g. foreach uri in ?a SELECT ?b, ?c, ?d, ?e, ?f, ?g, ?h WHERE ( uri , rdf:type , ?b ) , ( uri , ims:aggregationlevel, ?c) , ( uri , vra:subject, ?d) , ( uri , vra:creator, ?e) , ( uri , art:topic, ?f) , ( uri , art:geographic, ?g) , ( uri , dc:subject, ?h) end Stefano was asking for some whitepapers here, so some good references on semistructured databases and query languages are [1] [2] [3]. For more references related to SIMILE, see [4]. One language, Lorel, described in [1] makes it possible to perform queries parallel queries because SELECT is more flexible: you use SELECT you to specify what you want to retrieve and you put the constraints in WHERE. A more Lorel-like query syntax would allow us to perform our task in a single query e.g. SELECT (?a , ims:aggregationlevel, ?b) , (?a , vra:subject, ?c) , (?a , vra:creator, ?d) , (?a , art:topic, ?e) , (?a , art:geographic, ?f) , (?a , dc:subject, ?g) WHERE (?a , rdf:type , vra:Image ) , (?a , vra:subject, vraTopic:Architecture_artist) Comments or other proposals? [1] "An Overview of Semistructured Data." Suciu. SIGACTN: SIGACT News (ACM Special Interest Group on Automata and Computability Theory). vol. 29. 1998. http://citeseer.nj.nec.com/160105.html [2] "Querying Semi-Structured Data." Serge Abiteboul. ICDT. 1997. pp. 1-18. http://citeseer.nj.nec.com/abiteboul97querying.html [3] "Semistructured data." Peter Buneman. 1997. pp. 117--121. http://citeseer.nj.nec.com/buneman97semistructured.html [4] http://web.mit.edu/simile/www/documents/researchDrivers/simileBib.html Mark Butler Research Scientist HP Labs Bristol http://www-uk.hpl.hp.com/people/marbut
Received on Wednesday, 28 January 2004 08:57:59 UTC