Data Integration: some enabling steps

Andy Seaborne <andy.seaborne@hp.com>
Dave Reynolds <der@hplb.hpl.hp.com>

September 2004

Introduction

The Semantic Web provides an architecture for the exchange and use of information. RDF provides the information exchange format, and OWL provides ways to classify information and to publish the conceptual models behind the information.

Life sciences is an information-rich domain and there is strong value in the ability to integrate data across the many different heterogeneous sources (genomics, proteomics, molecular pathways, assay results, toxicology, clinical trials data, patent searches, publications etc). Publishing RDF-based information with today's W3C recommendations and today's toolkits is just the start of information exchange. It can be done - but the costs are significant and new issues arise now the base-level formats are in place. Addressing deployment issues for large datasets would lower the costs to build and publish data while increasing the availability of information both on the internet, within enterprise intranets and inter-company extranets.

In the rest of the paper we outline a few of the technology issues which we seeing being relevant in this domain.

The issues that arise are not fundamental limitations of the base recommendations. Any system will be addressing these issues. Community agreements and shared practice will allow information publishers to concentrate on the information and allow high-quality toolkits to be uses and reused.

Some Issues with Large Datasets

Large datasets present deployment and data management issues. While query based access allows remote working with such databases, giving common technology to client applications, the needs to the dataset publishers and republishers are less well-served.

Caching and Replication

Caching and replication are important for efficiency and continuity of working. These bring with them the need to maintain such copies. It should be possible to exchange dataset updates in a common fashion, with both event driven and polled verification. RSS 1.0 provides a common RDF-based framework for recording the changes but needs an agreed vocabulary for recording dataset updates.

A way to define interest-profiles can further assist in monitoring for changes - only changes that match the interests of the client (application or person) need be recorded in a customized RSS feed, removing large numbers of unrelated changes and ensuring a client is not restricted by RSS feed only recording recent changes.

Ingestion Techniques

The creation of valuable datasets calls for care in ensuring that data incorporated into the datasets is correct. This is true even at the base RDF level - are URIs syntactically valid, do bNode identifiers get repeated within the same, large RDF/XML file.

Many of the basic problems are common to other communities who also create significant, valuable datasets such as the library community. Projects, such as SIMILE, have already explored such practical matters and also have experimented with various systems for large-scale RDF storage. Exchanging experience of building systems and information repositories with such communities would accelerate the deployment of a lifesciences information web.

Update Language

DAWG is covering remote access. Building on this work with an update language would enable toolsets to be developed which worked with different datastores. This would lower the development costs to information publishers because they would not need to develop store-specific applications to handle this function.

Distributed Query

The W3C RDF Data Access Working Group (DAWG) is developing a query language and remote access protocol for remote query to RDF databases over the web. This allows a client to issue queries against a single database. Beyond this, there should be the minimum to issue queries across databases. Today, this requires the client system to have custom application code to do this. Building on the work of DAWG to allow the description of multi-database queries, allowing a range of implementation approaches would provide the researcher to create powerful information tools more readily.

Provenance

Aggregation of data into a single dataset reduces effort needed to find information and enables researchers to make connections between different pieces of information. The value of such connections is dependent on knowing about the origin of information, how it was aggregated and any information processes applied to it.

The original RDF Recommendation provided reification as the basic building block for provenance systems. This had not been uniformly accepted and many systems provide their own mechanisms for this, going outside the RDF Model.

It would be advantageous if some mechanism became standardized so that provenance information can be exchanged between datasets and as part of the data management of datasets, including caching and replication.

Reification is based on the individual RDF statement yet the unit of information exchange is often some logical collection of statement making up some "data object". Provenance should work on this unit of information.

Reification is also seen as expensive in terms of storage. The reification quad can be compression but this still leaves the overhead of using the reification. For example, to create a group of statements using reification costs at least one extra statement for every statement to record such group membership. Systems may be able to reduce the impact of this but investing in local solutions discourages exchange of information and risks being superseded by some de facto standard and would cause rework in repurposing the dataset - often such repurposing is not done restricting use of that information repository.

Because there has been much discussion and experimentation with the lower-levels of provenance, there is an opportunity to create community agreement around these experimental solutions.

Annotation

An interesting feature of the semantic web design approach is its emphasis on the open world assumption. This, together with the use of global URIs, makes it very easy to support rich third-party annotation of data items. Rather than the data model having to build in explicit placeholders for annotations, annotations become simply extra RDF assertions that link to the data sources by virtue of using the appropriate URIs. This makes it possible for the same data source to viewed with different local and global annotations and cross-links by changing which annotation sources to include within the (distributed) query. There is no limitation on what can be included in such third party annotations - the full expressive power of RDF and OWL is available.

This does add to the deployment issues of managing distributed queries and provenance tracking. As well as recording the provenance of information within a data store it becomes necessary to be able to report which of data sources included in the distributed query delivered a given subset of the information. Provenance becomes not just a "source" label but a traceback of an entire chain of sources.

The user interface challenges are just as significant as the technical challenges. Allowing users to control the way sources are combined and to trace back the provenance of the sources in a transparent and intuitive fashion is not easy. It is hard to find the right balance between overloading with too much detailed meta level information and risking unintentionally misleading the user into treating all parts of a query response as equally definitive.

Ontology Management

Use of shared vocabularies is a key to successful information integration. The W3C OWL recommendations provide a solid foundation for the publication of ontologies and several key life science ontologies are already available in RDF or directly in OWL.

As with data set publication the scale and rate of evolution of the ontologies in the life sciences does raise practical deployment issues.

Firstly, there is a need for agreed best practice on version management of ontologies in a semantic web setting, for example policies for when concept URIs should be changed. This is an area of active interest with the Semantic Web Best Practice working group.

Secondly, there is the issue of remote access to ontology structures. Life science ontologies are too large for routine replication and downloading to be feasible - remote access protocols are required. The DAWG query language will be a good foundation for this but querying at the level of RDF triples is not always convenient and developers might desire additional support the higher level OWL abstractions. The interaction of remote access with version control is outside of the DAWG remit but will again benefit from shared best practice solutions.

Who we are

The Semantic Web Research Group of HPlabs is based in Bristol, UK.

One output of the group is the Jena RDF Framework, including its RDF publishing server Joseki. Jena provides comprehensive support for parsing, storing and accessing RDF data in a Java environment. It provides a fine-grain API to RDF, with storage implementation ranging from in-memory to relational databases such as MySQL and Oracle.

Jena also provides support for working with and using ontologies. It has an ontology API for OWL, RDFS and DAML+OIL. It has reasoners for RDFS and various profiles of OWL (different tradeoffs) and an external connection to Description Logic reasoners through a DIG interface. Jena has a large and active user community.

Open source tools published under a BSD-style licence:

Jena - an RDF Framework for Java
Joseki - an RDF publishing server
Nuin - an agent platform for Jena