RE: Longwell custom browser prototype, RDQL and Joseki from Butler, Mark on 2004-01-28 (www-rdf-dspace@w3.org from January 2004)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Wed, 28 Jan 2004 13:57:16 -0000
To: " (www-rdf-dspace@w3.org)" <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E808ED21AE@0-mail-br1.hpl.hp.com>

Hi Team,

I now have a pure RDQL query version of the Longwell query engine. Here are
some performance figures comparing Lucene, the Jena API and pure RDQL for a
local, in-memory model containing 1/34th of the Artstor data:

				RDQL		Jena 		Lucene
Time to load data		39139 ms	39000 ms	39057 ms
Time to index data	  291 ms	   60 ms	54747 ms
Time to query data	23748 ms	  940 ms	 1224 ms

So as you can see, using RDQL is around 24 times as slow as using the Jena
API. The main reason for this is we are dealing with semistructured data, so
as we are not certain that all properties will exist we have to query those
properties serially. To recap from my previous post, first we determine
which resources meet our constraints. For example if our constraints are

rdf:type = vra:Image
vra:subject vraTopic:Architecture_artist

then we use a query like this (omitting USING for brevity) ...

SELECT ?a, 
WHERE  (?a , rdf:type , vra:Image ) ,
	 (?a , vra:subject, vraTopic:Architecture_artist)

then we query each member of ?a serially e.g. 

foreach uri in ?a
  SELECT ?b WHERE ( uri , rdf:type , ?b )
  SELECT ?b WHERE ( uri , ims:aggregationlevel, ?b)
  SELECT ?b WHERE ( uri , vra:subject, ?b)
  SELECT ?b WHERE ( uri , vra:creator, ?b)
  SELECT ?b WHERE ( uri , art:topic, ?b)
  SELECT ?b WHERE ( uri , art:geographic, ?b)
  SELECT ?b WHERE ( uri , dc:subject, ?b)
end

rather than in parallel e.g. 

foreach uri in ?a
  SELECT ?b, ?c, ?d, ?e, ?f, ?g, ?h
  WHERE ( uri , rdf:type , ?b ) , 
        ( uri , ims:aggregationlevel, ?c) ,
        ( uri , vra:subject, ?d) , 
	  ( uri , vra:creator, ?e) , 
	  ( uri , art:topic, ?f) , 
	  ( uri , art:geographic, ?g) , 
	  ( uri , dc:subject, ?h)
end

Stefano was asking for some whitepapers here, so some good references on
semistructured databases and query languages are [1] [2] [3]. For more
references related to SIMILE, see [4]. One language, Lorel, described in [1]
makes it possible to perform queries parallel queries because SELECT is more
flexible: you use SELECT you to specify what you want to retrieve and you
put the constraints in WHERE. A more Lorel-like query syntax would allow us
to perform our task in a single query e.g.

SELECT (?a , ims:aggregationlevel, ?b) ,
       (?a , vra:subject, ?c) , 
	 (?a , vra:creator, ?d) , 
	 (?a , art:topic, ?e) , 
	 (?a , art:geographic, ?f) , 
	 (?a , dc:subject, ?g)
WHERE  (?a , rdf:type , vra:Image ) ,
	 (?a , vra:subject, vraTopic:Architecture_artist)

Comments or other proposals?

[1] "An Overview of Semistructured Data." Suciu. SIGACTN: SIGACT News (ACM
Special Interest Group on Automata and Computability Theory). vol. 29. 1998.
http://citeseer.nj.nec.com/160105.html

[2] "Querying Semi-Structured Data." Serge Abiteboul. ICDT. 1997. pp. 1-18.
http://citeseer.nj.nec.com/abiteboul97querying.html

[3] "Semistructured data." Peter Buneman. 1997. pp. 117--121.
http://citeseer.nj.nec.com/buneman97semistructured.html

[4] http://web.mit.edu/simile/www/documents/researchDrivers/simileBib.html

Mark Butler
Research Scientist 
HP Labs Bristol
http://www-uk.hpl.hp.com/people/marbut

Received on Wednesday, 28 January 2004 08:57:59 UTC