Longwell custom browser prototype, RDQL and Joseki

Hi team,

A while back we agreed that in an ideal world, there are three stages to the
demo:

1) browse and query a subset of the Artstor and the OCW data

2) browse and query the entire Artstor and OCW data from a local persistant
store

3) browse and query the entire Artstor and OCW data from a HTTP based store

At the telecon last week, I stated that I think 3 may be difficult in the
timescale of the demo. I want to explain exactly why this and then discuss
the implications of this for the demo.


BACKGROUND

Hopefully by now other people will have had a chance to download the custom
browser prototype, which I have named "Longwell" from CVS. Early versions of
Longwell used Lucene to index and query data, but I've now finished a
version of Longwell that uses a mixture of RDQL and Jena to query the data.
I want to explain the process that Longwell uses to query the data and the
implications of this for 3. I'm guessing that Vineet's faceted browser in
Haystack must work in a similar way, so these points may also be relevant to
Haystack although perhaps Vineet can explain how his browser works. 

Longwell is a faceted browser, so its UI is configured with three bits of
information - see ArtstorRepository.java:

- the types of the RDF data objects to display (for example vra:Image and
lomEdu:learningResourceType)
- which fields to display (a field is a property with a literal value)
- which facets to display (a facet is a property that points to a controlled
term represented by another RDF data object)

Its UI consists of two separate elements (see the enclosed image):

- the results set (the left hand pane) of all the data objects that match
the current set of facet restrictions

- a facet navigator (the right hand pane) that lists the current
restrictions along with further restrictions which may be placed on the
results set

Longwell creates the UI in three stages:

- First it queries the data to identify the URIs of all the resources
corresponding to data objects that belong to registered types and match the
current restrictions. 

- Then it builds the result set using a subset of these URIs - initially the
results 1 to 10 although the user can page through the results set - and
queries all the registered fields and facets in order to generate the
results set. Some of these fields and facets can have multiple values e.g.

London: 	Bank of England drawing interior of Banking Hall by J.M.
Gandy, 1798 
creation:  	1732-1833  
period:  	18th C. A.D  
type:  	VRA Image  
subject:  	London (England)--Bank of England 
		Banks (buildings) 
		financial buildings 
		banking halls 
		Drawing  
topic:  	Architecture: Artist  
geographic: England  
creator:  	Soane, Sir John,1753-1837  

- Then it builds the facet navigator by querying the facet values for the
entire results set, to build a list of unique facet values for each
configured facet type e.g.

Topic:  Architecture  | Ceramics | Drawing | Geography | Glass | History |
Manuscripts | Maps | Metalwork | Mosaics | Music Composers | Painting |
Photography | Posters | Prints | Sculpture: Coins | Special Societal Groups
| Tapestry | Vase Paintings 

Geographic: Byzantine | Celtic |  China | England | Finland | Flanders
(Belgium) | France | Germany | Greece | Greek | India: Late Andhra Period |
Italy | Japan: Meiji Period | Late Antique/Early Christian | Mexico: Aztec |
Minoan (Knossos) | Netherlands | Palaeolithic | Portugal | Prehistoric

Currently Longwell does not calculate frequency of occurrence as Vineet's
browser does but this is on the "to do" list. 


COMPARING RDQL / JENA with LUCENE / JENA

Currently Longwell takes 1142 ms to perform stages 1, 2 and 3 at worst case
i.e. no restrictions for 3300 records which is 1/34 of the entire dataset
when querying an in memory model. Task manager indicates Longwell takes 130
megabytes of ram. We can therefore extrapolate likely performance for the
entire Artstor dataset: assuming we had a machine with enough ram (say 5
GB), and Java could deal with that amount of memory (which it might not, I
remember Nick talking about this when I was in Boston) it should take around
40 seconds for worst case. If we assume a ten fold slow down for a
persistant model, then we are talking of the order of minutes. I'm guess
that querying via HTTP could potentially add another ten fold slowdown here.

Interestingly, RDQL/Jena do compare favourably against Lucene e.g.

						Jena / Lucene	Jena / RDQL
Time to read in data and schemas	37000 ms	      37000 ms
Time to build index			52000 ms		    0 ms
Worst case time to build UI		 1582 ms		 1142 ms

although I suspect I'm not giving Lucene the chance to cache indexers so I
need to check my code. 


ISSUES WITH USING RDQL OVER HTTP

Now I am still getting to grips with RDQL so perhaps Andy can correct me if
I am wrong but seems to me there are a number of potential difficulties in
implementing the three stages used in Longwell via RDQL over HTTP:

a) In stage 1 we create x RDQL queries, where x corresponds to the number of
registed object types (unless we have a query that contains an type
constraint) e.g

SELECT ?a WHERE (?a, p1, v1), (?a, p2, v2), ... (?a, pn, vn), (?a, rdf:type,
vra:Image)
SELECT ?a WHERE (?a, p1, v1), (?a, p2, v2), ... (?a, pn, vn), (?a, rdf:type,
lomEdu:LearningResourceType)

Where p1 ... pn are the properties and v1 ... vn are the values in the
restrictions. For example, to create the initial results set, we use the
queries

SELECT ?a WHERE (?a, rdf:type, vra:Image)
SELECT ?a WHERE (?a, rdf:type, lomEdu:LearningResourceType)

Now this is fine when querying a local store, but when querying over HTTP
the most efficient way to get the results is a list of resources rather than
a subgraph, as the subgraph will be of the form

a
	p1		v1 ;
	p2		v2 ;
	...
	pn		vn ;
	rdf:type	vra:Image .

i.e. will mainly consist of information we already have. Does Joseki provide
a way to just retrieve resources, or does it return the subgraph? Better
still, we may not want to retrieve this results set at all, as it is
intermediate computation so it is better to keep it on the server to use it
as the basis for queries corresponding to stages 2 and 3. 

b) In stage 2, we have to query each property corresponding to a field or
facet individually for a subset of the results set acquired in the previous
stage. The individual bit is important because the properties may not exist,
so if we query all the properties concurrently then will only get the
objects that correspond to subjects that have all the properties when we
actually want objects that correspond to subjects that also have a subset of
the properties. 

This creates a problem when quering over HTTP as we want to minimize the
number of queries, so really we want to query these properties in a single
query that can express these properties are optional rather than
constraining. Dave Reynold's paper on QBE
http://www.hpl.hp.com/semweb/publications/DaveR-www2003.pdf
discusses the use of predicate patterns to solve this problem, and I guess
they can be used in RDQL as it supports REGEX expressions, but ideally using
namespaces in this way is not sufficiently specific as we want to specify a
subset of properties corresponding to the descriptive metadata and omit the
technical metadata. 

Second here the query processor must return subgraphs otherwise we will
potentially have problems with properties with multiple values as Dave
Reynolds describes in section 2.1.1 of the QBE paper.

Third, as far as I know, RDQL does not provide functionality similar to
cursors so it is not possible to retrieve a subset of the results set. Nor
does it provide functionality similar to sort by, so providing cursors is
potentially meaningless as in Jena there is no guarantee about statement
ordering in models, so the ordering of results sets may vary between
subsequent queries. 

c) In stage 3, we have the same problems of having to query facets
individually and multiple values for the same property. Note stage 3 is much
more critical than stage 2, as in stage 2 we just work on a fixed size
subset of the results set whereas in stage 3 we work across the entire
results set, so at a minimum the time taken to calculate stage 3 will
increase linearly with the size of the dataset. 

In addition, as we are interested in summary statistics rather than every
individual facet value, it would be useful to have a mechanism similar to
count in SQL so we just get back unique facet values and their frequency. I
spoke to Andy about this a while ago, and he suggested precomputing the
frequencies of the facet values. Unfortunately precomputing in this way
won't help when processing restrictions as the frequencies of available
facets changes dependant of the restrictions currently in force i.e. we have
to recalculate. The worst possible case for calculating facet values is when
no restrictions are in place, and the majority of restrictions drastically
partition the dataset, so there may still be some advantage in precomputing
frequencies for this worst-case situation.


SUMMARY

So in summary it is not clear whether RDQL will support a number of things
that we would like:

- cache intermediate results on the server, and use them as the basis of
further queries.
- distinguish between predicates that must be matched and predicates that
are optional. 
- whether it is clever enough not to return information which is implicit in
the queries. 
- some way to sort results so we can guarantee ordering.
- retrieving results subsets (cursors in SQL).
- some to count the number of times that a data object is the object of a
particular property.


WORKAROUNDS

Clearly this is a lot of work to do in time for the demo but I would like to
propose a number of workarounds, and I would be interested to hear other
suggestions: 

1. We could have a browser that does not support a faceted navigation pane.
This will impact user functionality, because as we've discussed before
providing a summary of the collection as a faceted navigation pane makes it
much easier for the user to query collections that they are unfamiliar with.
However it may simply not be feasible to use facted navigation above a
certain dataset size.

2. Currently Longwell consists of a servlet with a Velocity template which
serves results as HTML. It would be possible to create another Velocity
template that served results as RDF/XML, and as Longwell uses query strings
to describe restrictions this provides the basis of a specialized query API.
It would be fairly straightforward to return the results set as RDF, but
returning the faceted navigation pane would require the creation of a small
schema / ontology. This approach is very specialized, but will be much more
efficient in its use of HTTP than implementing the queries as step-by-step
HTTP requests as outlined above. Alternatively we could have a single
template that created XML, and then XSLT to style that to HTML or RDF/XML. 

3. We could also (or additionally) provide an RDQL query API by giving
Joseki access to the model used by Longwell, so although the HTML interface
is being generated on the server, this does not preclude the use of RDF/XML
SW-enabled APIs.


Any other suggestions? Feedback? Thanks are due to Dave Banks for discussing
these issues with me.

Mark Butler
Research Scientist 
HP Labs Bristol
http://www-uk.hpl.hp.com/people/marbut 

Received on Monday, 26 January 2004 14:29:22 UTC