Andy Seaborne <andy.seaborne@hp.com>
August 2003
This note is a follow-on from the SIMILE plenary of 23-24 July 2003. It describes the work requirements of the plan as impacts Jena and Jena's relational database-backed storage subsystem provided by Genesis.
The main deliverable, for a SIMILE demonstrator end Dec 2003, is a metadata store with query access and a system for mappings between identified vocabularies.
SIMILE is an opportunity to further increase the maturity and robustness of HP's semantic web tools and to add new features for usability and performance by regarding SIMILE as a key user group. While it may be possible to produce some features by exploiting the SIMILE environment (example: forward-chaining over metadata in the SIMILE data ingestion process), our approach will be produce reusable technology which will form part of the Jena open source releases. This is compatible with the SIMILE project's own open source objective.
Description of demonstrators included in the plenary write-up.
See also the SIMILE Work Plan assignments.
Details in the plenary write-up but in summary:
Demo #1a is the most significant in work requirements.
The Semantic Web Programme is to provide a metadata store built on Jena to meet the functional and performance requirements of the demonstrators.
The following assumptions are currently made for the purposes of this note.
Unknowns:
We need to confirm these assumptions and pin down the unknowns. There are probably others. Now would be a good time to test any you have.
In addition to the Jena 2.0 distribution, the metadata store will need to provide:
The relevant part of the Jena architecture is that we have a stack:
specifically, the query language does not have mapping (inference) features built-in; it assumes that inference manifests itself through "virtual triples" in an RDF graph presented by the inference layer.
One of the main areas of investigation for SIMILE is the use of metadata from many vocabularies (taxonomies, schema, ontologies). The first demonstrator requires queries to retrieve items that have metadata records from two vocabularies (IMS and VRA core v3).
There are 3 approaches:
(1) is ideal, in the sense it is a standard. It may not be possible [untested: need to evaluate, particularly with respect to translation of elements in controlled vocabularies] and may yield an unmaintainable mapping as we have no graphical tools available. (2) might be clearer as a text file and be nearly the same. (3) would allow parallel access to the database with some care but is much, much more work.
We will try (1) and (2).
Each of these would enable an RDQL expressed in user terms only, without any knowledge of the underlying vocabularies used, to be executed by querying an inference graph providing the mapping rules.
The remote access mechanism is assumed to be based on Joseki.
We need to ensure early on that the Haystack-Joseki connection can be achieved along the ideas already discussed of passing queries (conjunctive triple patterns - RDQL is based on such patterns).
It is possible a converter from a Haystack-targeted language to RDQL, or to access Jena's underlying query execution mechanism directly, will be needed.
The current RDQL query language is assumed to be sufficient, based on example query systems (e.g. OCLC RDF Interoperability Testbed and Semantic Merge Demo) discussed so far. However, query expression and efficiency can be improved:
Execution of query constraints, especially for text contained in literal values, will be needed at some time.
In addition to validation against the CY2003 demo plan, the following items are needed as soon as possible:
We will also need:
For the HP Semantic Web programme, the majority of the work is involved in the CY2003 demonstrator. Until the dependencies are available, it is not possible to estimate delivery times. It is assumed that the complete metadata store subsystem will need to be finished by 1 Dec 2003, with earlier versions for versions for integration and testing necessary.
There are no currently identified roadblocks; a functional system, without mappings, could be produced by the SIMILE team now with the current Jena and Joseki toolkits. This could establish the base deployment environment and creation of the metadata corpus as no database schema changes are anticipated. Further demonstrator application work could be built with explicitly generated inferences using the existing Jena2 system; the mapping system does not change the query-based API.
In practice, performance improvements, like fast-path, constraint execution in the database, and possibly in the rules engine, would be sensible.
The main technical risk is in producing the mapping rules and with performance of these rules. Until we have realistic data and realistic mappings (ones not chosen by us to suit our system!) we do not know.