Lightweight Knowledge Aggregation With RDF/OWL

Lightweight Knowledge Aggregation With RDF/OWL

Position Paper for W3C Semantic Web and Life-Science Workshop, SWLS’04

Andreas Schneider

Department for Informatics, University of Zurich

Zurich, Switzerland

andreas.schneider@access.unizh.ch

A Local Semantic Web Index for PubMed Retrieval

Searching peer-reviewed articles on a topic is an important part of the work-flow in scientific research. Even for an experienced researcher who masters Boolean expressions and the advanced features of PubMed, the task to systematically collect relevant articles can be tedious and time-consuming.

We explore the potential of Semantic Web based systems to improve this process. Given the absence of RDF resources available for this problem, we are building RDF resources locally and then using them as index structures for further exploration and searching.

In this way a local semantic index is built semi-automatically to explicitly represent a concept within a domain, with the purpose, to improve a retrieval system's ability to retrieve further documents related to that concept. The RDF models play a similar role for the user as Web directories or his collection of bookmarks.

Building Model Based Systems

Based on this idea of a local Semantic Web, Model Based Systems Architecture (MBS) has been designed as a flexible framework. It combines Semantic Web technology with full-text indexing to build specialized retrieval and knowledge aggregation systems.

Model Based System architecture includes the following components:

Gatherer
ModelBuilder
Indexer
QueryEngine
ModelBrowser
OntologyEditor

All these components use a common RDF Model Base repository. In addition to access the model with RDQL the Indexer component maintains a full-text index on all RDF resource nodes.

MBS prototype: A PubMed Retrieval System

To validate the Model Based System (MBS) architecture a prototype system has been implemented. The Jena Semantic Web framework is used to implement Semantic Web technology. Jakarta Lucene is used for the Indexer component. In our view, the three components connected to the user interface are of great importance. They are implemented with Apple's Cocoa Java framework for Mac OSX.

Given a list of target gene names, this system is used to retrieve ranked lists of PubMed articles that are related to the project of the user. RDF resources nodes for the following resource types play an important role in this process:

PubMed. A local library of relevant PubMed articles.
Entrez Gene. Dynamically mirrored NCBI information on selected gene names.
Mouse Genome Informatics. Complete list of marker names for the mouse genome.
Gene Ontology. Complete list of gene ontology terms.

Additional resource nodes are built for information on queries, ranking information, and term vectors representing local concepts of interest.

Such a system has the potential to speed-up the literature research process on PubMed. New information is automatically aggregated and can be efficiently reused for further searching.

Experiences/Results

While the evaluation of our MBS prototype is still ongoing work, first results are encouraging us to further investigate the benefits of Semantic Web technologies for specialized retrieval and knowledge aggregation systems. In particular we think that the RDF data model is well suited to support research activities in a dynamic environment. In our view one reason for this lies in the fact that RDF is relaxing the identity constraint put on keys by the relational model.

Some of the benefits of a Semantic Web approach for a specialized retrieval process like PubMed literature research include:

Explicit Semantic Indexing. A local RDF model can be built of the user's information need as well as of the resources retrieved in previous retrieval cycles. This semantic model is then reused as a semantic index, which can be accessed to prepare for and proceed with further retrieval cycles.
Ontology Driven Configuration. A simple OWL ontology editor interface allows the user to configure the retrieval process with concepts specifying his information needs.
Combining Search and Browse Modes. Building a local model on a topic and making it accessible for browsing with a model browser interface enhances a user's possibilities to explore that topic. A full-text search engine helps to find interesting starting points for browsing.
Local and Global Context Analysis. Automated local and global context analysis depends on means to let the user inspect the document space and provide him with clues for further search cycles. Our prototype includes two such means. The complete GeneOntology terms are made available to the user AND to search engines. Information from previous retrievals is used to expand gene names with their synonyms.
Enabling Efficient Retrieval. Using a local Semantic Web model, we demonstrate the feasibility of a retrieval sytem combining high precision and recall. With new articles arriving at the system, a ranking algorithm calculates their degree of matching with a local concept. This concept is the OWL/RDF representation of the users information need.

Need For Coordination - Modifications

The following use cases from RDF Data Access Use Cases and Requirements are somehow relevant for the tasks our prototype is addressing:

3.4 subgraph results,
3.5 local queries,
3.10 result limits,
4.2 aggregation graphs,
4.3 non-existent triples,
4.5 aggregate queries,
4.5.1 querying multiple sources,
4.6 additional semantic information,
4.8 literal search,
4.10 adressable query results

From our point of view we are not aware at this point of a new need for coordination or of any modifications to the specifications.

Future Work

At this stage we have identified a number of open issues whose solution will improve the Model Based Systems architecture. They might be of interest for people engaged in further developing the Semantic Web specifications.

Similarity measures for ontologies. Measures should support efficient establishment of similarity relationships between two resources.
Methods for effective model maintainance. With our approach the local model might be growing very fast. Automated maintenance mechanisms should be added to reduce the amount of superfluous information in the system. Maintenance mechanisms might become even more important with the addition of automatic statements made by reasoning mechanisms.
Improving the user interface. In a Semantic Web browser the user is confronted with 3 URIs simultaneously as well as with unsorted lists of statements from unified resources. This is a challenge for the design of intuitive user interfaces.
Runtime ontology compilation. Runtime translation from ontology documents to Java classes would be beneficial for dynamic configuration of system behaviour.

Acknowledgements

This position paper is part of the author's diploma thesis. He would like to thank his supervisors Esther Kaufmann and Abraham Bernstein (Dynamic and Distributed Systems Group) for discussions leading to the submission of this position paper. Special thanks go to Ned Mantei (Institute of Cell Biology, Swiss Federal Institute of Technology) for his support with Life-Science domain knowledge.