SIMILE : Implications on the SWEB Programme

August 2003

This note is a follow-on from the SIMILE plenary of 23-24 July 2003. It describes the work requirements of the plan as impacts Jena and Jena's relational database-backed storage subsystem provided by Genesis.

The main deliverable, for a SIMILE demonstrator end Dec 2003, is a metadata store with query access and a system for mappings between identified vocabularies.

SIMILE is an opportunity to further increase the maturity and robustness of HP's semantic web tools and to add new features for usability and performance by regarding SIMILE as a key user group. While it may be possible to produce some features by exploiting the SIMILE environment (example: forward-chaining over metadata in the SIMILE data ingestion process), our approach will be produce reusable technology which will form part of the Jena open source releases. This is compatible with the SIMILE project's own open source objective.

Plenary Outputs

Description of demonstrators included in the plenary write-up.

See also the SIMILE Work Plan assignments.

Demonstrators

Details in the plenary write-up but in summary:

Demo #1a calls for a metadata store with two, independent vocabularies. Show queries only finding things from the vocabulary the query is expressed in. A mapping is introduced and, without taking the server out of use for a period of time (c.f. database rebuild), the same queries now return more results with records from the other vocabulary as well.
Demo #1b is the same but with three vocabularies and 3 1-1 mappings.
Demo #2a is Schema-driven instance editing and visualization.
Demo #2b is producing and introducing a vocabulary with local extensions (e.g. VRAcore + local shelf number).

Demo #1a is the most significant in work requirements.

Work Areas for the Semantic Web Programme

The Semantic Web Programme is to provide a metadata store built on Jena to meet the functional and performance requirements of the demonstrators.

Unknowns and Assumptions

The following assumptions are currently made for the purposes of this note.

The target, for the SIMILE project, is to build demonstrator #1a by end CY2003.
The scale is of the order of 40k records of metadata (total? per-vocabulary?)
The demonstrator will take the form of a metastore and an instance of DSpace to store the content i.e. the metadata store is not storing content
A mapping is added during the demo but the metadata records are already in the store initially.
The descriptions of the systems call for a "Mapping Repository". This is not necessary for the end CY2003 demonstrator.
There is no visual presentation interface provided as part of the metadata store.
The visual presentation is either a standard architecture web application (N-tier) or a remote client such as Haystack.
The two vocabularies are VRAcore v3 and IMS. Although the metastore is not dependent on which vocabularies are used, the development of the mapping engine will be directed to meet the needs of interworking between these two vocabularies in the first instance.
In order to provide realistic operation, the metadata (in RDF; for both corpus and vocabularies) is provided by other groups within the SIMILE programme.
Again, for realistic operation, the mapping between VRAcoreV3 and IMS will initially be designed by other sections of the SIMILE programme. The Semantic Web programme will assist as needed: the initial semantic relationships will need to come from a domain expert.
The connection to the metastore is one or both of Joseki or Java code (or compatible).
The metadata store is based on Jena, with a relational database used as a Jena persistent store. The relational database is one of the ones supported by Jena (specifically, the database work by the Genesis team).
All the code is, or will be, part of the Jena open source distribution.
The remote interface is Joseki, possibly with additions
The query language is RDQL, possibly with additions
The mapping functionality is a Jena inference engine
The performance goal is to adequately support the demonstrator presentation

Unknowns:

The detailed demonstrator description for demo #1a
Is translations between different controlled vocabularies required as part of query access.

We need to confirm these assumptions and pin down the unknowns. There are probably others. Now would be a good time to test any you have.

Work Areas

In addition to the Jena 2.0 distribution, the metadata store will need to provide:

Mechanism for the vocabulary-vocabulary mappings
Remote access for clients such as Haystack
Suitable query language modifications
Store changes in support of the query language

The relevant part of the Jena architecture is that we have a stack:

Connection/Presentation API
Query language
Inference engines
Storage, including query optimization

specifically, the query language does not have mapping (inference) features built-in; it assumes that inference manifests itself through "virtual triples" in an RDF graph presented by the inference layer.

Vocabulary-vocabulary mapping

One of the main areas of investigation for SIMILE is the use of metadata from many vocabularies (taxonomies, schema, ontologies). The first demonstrator requires queries to retrieve items that have metadata records from two vocabularies (IMS and VRA core v3).

There are 3 approaches:

OWL to define equivalence
Custom rule set
Custom (hardwired) reasoner

(1) is ideal, in the sense it is a standard. It may not be possible [untested: need to evaluate, particularly with respect to translation of elements in controlled vocabularies] and may yield an unmaintainable mapping as we have no graphical tools available. (2) might be clearer as a text file and be nearly the same. (3) would allow parallel access to the database with some care but is much, much more work.

We will try (1) and (2).

Each of these would enable an RDQL expressed in user terms only, without any knowledge of the underlying vocabularies used, to be executed by querying an inference graph providing the mapping rules.

Remote Access

The remote access mechanism is assumed to be based on Joseki.

We need to ensure early on that the Haystack-Joseki connection can be achieved along the ideas already discussed of passing queries (conjunctive triple patterns - RDQL is based on such patterns).

It is possible a converter from a Haystack-targeted language to RDQL, or to access Jena's underlying query execution mechanism directly, will be needed.

Query Language

The current RDQL query language is assumed to be sufficient, based on example query systems (e.g. OCLC RDF Interoperability Testbed and Semantic Merge Demo) discussed so far. However, query expression and efficiency can be improved:

Specific operators for text matching of literals (rather than general regular expressions)
Optimization of value constraints by executing in the database engine.

Storage

Execution of query constraints, especially for text contained in literal values, will be needed at some time.

Misc

The "no server downtime" may require some engineering changes, especially for Joseki (it only reads configurations at startup). This does not seem to be serious.

Dependencies

In addition to validation against the CY2003 demo plan, the following items are needed as soon as possible:

Access to an RDF metadata corpus
RDF versions of the vocabularies to be used
Initial semantics of the mapping between vocabularies

We will also need:

Haystack-Joseki experimentation
Deployment/demo environment

Timescale

For the HP Semantic Web programme, the majority of the work is involved in the CY2003 demonstrator. Until the dependencies are available, it is not possible to estimate delivery times. It is assumed that the complete metadata store subsystem will need to be finished by 1 Dec 2003, with earlier versions for versions for integration and testing necessary.

There are no currently identified roadblocks; a functional system, without mappings, could be produced by the SIMILE team now with the current Jena and Joseki toolkits. This could establish the base deployment environment and creation of the metadata corpus as no database schema changes are anticipated. Further demonstrator application work could be built with explicitly generated inferences using the existing Jena2 system; the mapping system does not change the query-based API.

In practice, performance improvements, like fast-path, constraint execution in the database, and possibly in the rules engine, would be sensible.

The main technical risk is in producing the mapping rules and with performance of these rules. Until we have realistic data and realistic mappings (ones not chosen by us to suit our system!) we do not know.