RE: Longwell custom browser prototype, RDQL and Joseki from Seaborne, Andy on 2004-01-29 (www-rdf-dspace@w3.org from January 2004)

From: Seaborne, Andy <Andy_Seaborne@hplb.hpl.hp.com>
Date: Thu, 29 Jan 2004 15:18:46 -0000
To: www-rdf-dspace@w3.org
Message-ID: <E864E95CB35C1C46B72FEA0626A2E808712FEC@0-mail-br1.hpl.hp.com>
> ----Original Message----
> From: Butler, Mark <mailto:Mark_Butler@hplb.hpl.hp.com>
> Date: 28 January 2004 13:58
> 
> Hi Team,
> 
> I now have a pure RDQL query version of the Longwell query engine. 
> Here are some performance figures comparing Lucene, the Jena API and 
> pure RDQL for a local, in-memory model containing 1/34th of the
> Artstor data:   
> 
> 				RDQL		Jena 		Lucene
> Time to load data	39139 ms	39000 ms	39057 ms
> Time to index data	  291 ms	   60 ms	54747 ms
> Time to query data	23748 ms	  940 ms	 1224 ms

For your inner loop, Jena access will incur maybe one or two hash table
lookups per value.  i.e it is very low overhead, given there are no direct
access patterns.

When you use RDQL, every single value access is incuring parser overhead -
javaCC and object construction.  Parsing a query is many times more
expensive than a couple of hash table lookups.  You are doing the equivalent
of selecting each cell (row/col combination) one access at a time.  Better
to get the whole row.

You can also used preparsed queries - there is an API call to set the
context of the query so you can bind some variables later than parse time
(client template queries - same idea as most SQL systems).

> So as you can see, using RDQL is around 24 times as slow as using the 
> Jena API. The main reason for this is we are dealing with 
> semistructured data, so as we are not certain that all properties will 
> exist we have to query those properties serially. To recap from my 
> previous post, first we determine which resources meet our
> constraints. For example if our constraints are     
> 
> rdf:type = vra:Image
> vra:subject vraTopic:Architecture_artist
> 
> then we use a query like this (omitting USING for brevity) ...
> 
> SELECT ?a,
> WHERE  (?a , rdf:type , vra:Image ) ,
> 	 (?a , vra:subject, vraTopic:Architecture_artist)
> 
> then we query each member of ?a serially e.g.
> 
> foreach uri in ?a
>   SELECT ?b WHERE ( uri , rdf:type , ?b )
>   SELECT ?b WHERE ( uri , ims:aggregationlevel, ?b)
>   SELECT ?b WHERE ( uri , vra:subject, ?b)
>   SELECT ?b WHERE ( uri , vra:creator, ?b)
>   SELECT ?b WHERE ( uri , art:topic, ?b)
>   SELECT ?b WHERE ( uri , art:geographic, ?b)
>   SELECT ?b WHERE ( uri , dc:subject, ?b)
> end
> 
> rather than in parallel e.g.
> 
> foreach uri in ?a
>   SELECT ?b, ?c, ?d, ?e, ?f, ?g, ?h
>   WHERE ( uri , rdf:type , ?b ) ,
>         ( uri , ims:aggregationlevel, ?c) ,
>         ( uri , vra:subject, ?d) ,
> 	  ( uri , vra:creator, ?e) ,
> 	  ( uri , art:topic, ?f) ,
> 	  ( uri , art:geographic, ?g) ,
> 	  ( uri , dc:subject, ?h)
> end

Quick fix after a look at the code: do them in parallel:

SELECT ?b WHERE ( uri , ?p , ?b )

It is better to get a larger chunk with more information because the system
will be able to grab that chunk in one go, rather than pick things out
without any help to the system as to the overall intent.

This will be very important when you move to a database. This applies
throughout Longwell not just at this point.  The proper change is to review
the access patterns and get larger chunks from the DB.  For in-memory use,
this won't make a different; moving to a disk DB, it will.

- - - - - -

The Joseki "fetch" style access is for exactly this case - get everything
known about a resource.  The exact data returned can be controlled by the
"fetch" moudle invoked - if you need application-specific code, you can
write a module although better appraoch is to find out about a resource and
them select the information needed at the client.  The overhead of any extra
data will have to be quite high before it is noticable.

> 
> Stefano was asking for some whitepapers here, so some good references 
> on semistructured databases and query languages are [1] [2] [3]. For 
> more references related to SIMILE, see [4]. One language, Lorel, 
> described in [1] makes it possible to perform queries parallel
> queries because SELECT is more    
> flexible: you use SELECT you to specify what you want to retrieve and 
> you put the constraints in WHERE. A more Lorel-like query syntax would 
> allow us to perform our task in a single query e.g.
> 
> SELECT (?a , ims:aggregationlevel, ?b) ,
>        (?a , vra:subject, ?c) ,
> 	 (?a , vra:creator, ?d) ,
> 	 (?a , art:topic, ?e) ,
> 	 (?a , art:geographic, ?f) ,
> 	 (?a , dc:subject, ?g)
> WHERE  (?a , rdf:type , vra:Image ) ,
> 	 (?a , vra:subject, vraTopic:Architecture_artist)
> 
> Comments or other proposals?
> 
> [1] "An Overview of Semistructured Data." Suciu. SIGACTN: SIGACT News 
> (ACM Special Interest Group on Automata and Computability Theory). 
> vol. 29. 1998. http://citeseer.nj.nec.com/160105.html
> 
> [2] "Querying Semi-Structured Data." Serge Abiteboul. ICDT. 1997. pp. 
> 1-18. http://citeseer.nj.nec.com/abiteboul97querying.html
> 
> [3] "Semistructured data." Peter Buneman. 1997. pp. 117--121. 
> http://citeseer.nj.nec.com/buneman97semistructured.html
> 
> [4] 
> http://web.mit.edu/simile/www/documents/researchDrivers/simileBib.html
> 
> Mark Butler
> Research Scientist
> HP Labs Bristol
> http://www-uk.hpl.hp.com/people/marbut
Received on Thursday, 29 January 2004 10:19:52 UTC