Lightweight Knowledge Aggregation With RDF/OWL

Position Paper for W3C Semantic Web and Life-Science Workshop, SWLS’04

Andreas Schneider

Department for Informatics, University of Zurich

Zurich, Switzerland

andreas.schneider@access.unizh.ch

A Local Semantic Web Index for PubMed Retrieval

Searching peer-reviewed articles on a topic is an important part of the work-flow in scientific research. Even for an experienced researcher who masters Boolean expressions and the advanced features of PubMed, the task to systematically collect relevant articles can be tedious and time-consuming.

We explore the potential of Semantic Web based systems to improve this process. Given the absence of RDF resources available for this problem, we are building RDF resources locally and then using them as index structures for further exploration and searching.

In this way a local semantic index is built semi-automatically to explicitly represent a concept within a domain, with the purpose, to improve a retrieval system's ability to retrieve further documents related to that concept. The RDF models play a similar role for the user as Web directories or his collection of bookmarks.

Building Model Based Systems

Based on this idea of a local Semantic Web, Model Based Systems Architecture (MBS) has been designed as a flexible framework. It combines Semantic Web technology with full-text indexing to build specialized retrieval and knowledge aggregation systems.

Model Based System architecture includes the following components:

architecture.jpg

All these components use a common RDF Model Base repository. In addition to access the model with RDQL the Indexer component maintains a full-text index on all RDF resource nodes.

MBS prototype: A PubMed Retrieval System

To validate the Model Based System (MBS) architecture a prototype system has been implemented. The Jena Semantic Web framework is used to implement Semantic Web technology. Jakarta Lucene is used for the Indexer component. In our view, the three components connected to the user interface are of great importance. They are implemented with Apple's Cocoa Java framework for Mac OSX.

Given a list of target gene names, this system is used to retrieve ranked lists of PubMed articles that are related to the project of the user. RDF resources nodes for the following resource types play an important role in this process:

Additional resource nodes are built for information on queries, ranking information, and term vectors representing local concepts of interest.

Such a system has the potential to speed-up the literature research process on PubMed. New information is automatically aggregated and can be efficiently reused for further searching.

Experiences/Results

While the evaluation of our MBS prototype is still ongoing work, first results are encouraging us to further investigate the benefits of Semantic Web technologies for specialized retrieval and knowledge aggregation systems. In particular we think that the RDF data model is well suited to support research activities in a dynamic environment. In our view one reason for this lies in the fact that RDF is relaxing the identity constraint put on keys by the relational model.

Some of the benefits of a Semantic Web approach for a specialized retrieval process like PubMed literature research include:

Need For Coordination - Modifications

The following use cases from RDF Data Access Use Cases and Requirements are somehow relevant for the tasks our prototype is addressing:

From our point of view we are not aware at this point of a new need for coordination or of any modifications to the specifications.

Future Work

At this stage we have identified a number of open issues whose solution will improve the Model Based Systems architecture. They might be of interest for people engaged in further developing the Semantic Web specifications.

Acknowledgements

This position paper is part of the author's diploma thesis. He would like to thank his supervisors Esther Kaufmann and Abraham Bernstein (Dynamic and Distributed Systems Group) for discussions leading to the submission of this position paper. Special thanks go to Ned Mantei (Institute of Cell Biology, Swiss Federal Institute of Technology) for his support with Life-Science domain knowledge.