Position Paper for Semantic Web for Life Sciences Workshop

Call for a Gene Definition Ontology

 

Jianjun Zhang

Genentech, Inc

 

Semantic Web holds the promise for intelligent systems on the Web. Relying on machine-readable resource description and web ontology, a software agent can automatically retrieve, aggregate, and analyze information scattered around the world. This is especially useful in the life sciences field, where annotations and experimental findings are inherently distributed. Many individual manual efforts have been carried out to integrate such data for knowledge discovery purposes. With the advent of Semantic Web, it is perceivable that most of these efforts, if not all, may be achieved through automatic means in the future. This will greatly enhance and accelerate the life sciences discovery process.

 

In order for such knowledge discovery systems to be functional, the biological information on the web must be presented in a way so that correlation and inference from multiple independent sources are possible. To reach this goal, we need at the very basics, common syntactical constructs so that the information can be machine-processed. Furthermore, in order for inference and reasoning to happen, we also need agreement on meanings of the terms used in the field. Defining a commonly used Web Ontology for life sciences will go a long way to help building the foundation structure.

 

Many biology-specific ontology efforts have been attempted, and many ongoing projects exist in this arena. The most widely used ontology within the biomedical research communities is Gene Ontology (GO), among others. However, as a bioinformatics software and tools developer, I have found that the most needed ontology is probably a universal definition of genes. Much time and man power has been consumed in many organizations to link various findings on the same gene together, simply because the gene or its product is named differently by different sources. If we have a universal gene name definition that are adopted by various resources, this kind of tasks would become trivial. Of course, a smart software agent will still need to sort out the aggregated information and perform inference and reasoning. But at least, the first step would be much easier. The existing ontologies, such as Gene Ontology, provide definition of terms that describe a gene's biological properties, but not genes themselves. As a first version of my wish list, a Gene Definition Ontology should provide the following features:

 

1. Define clearly what a gene is.

 

2. Define universal unique identifiers for genes and their products. LSID already provides a syntactic construct that may be used to build such unique identifiers. These identifiers may be developed by a consortium, and adopted by participating organizations. Alternatively, a service could be created on top of an LSID implementation that links other identifiers to this set of unique identifiers.

 

3. Allow relationships between genes to be presented. For example, gene A product up-regulates gene B transcription. Interaction with GO terms should be possible in this respect.

 

4. Allow relationships between gene definitions and sequences, including genomic, cDNA, probesets, to be presented. There are already ongoing sequence ontology projects. Coordination of the ontologies is important to ensure compatibility and interoperability.

 

5. The ontology may exist for multiple species, and the relationship between genes of different species may be presented (e.g. orthologs). Again, interactions with existing widely used taxonomy are desirable.

 

6. Define a core set of terms for the life sciences domain. GO terms probably already cover many of them, but more may be needed. These terms are used to describe relationships of entities in this domain, just like Dublin Core Metadata for publishing.

 

7. Addition of annotations (adding new properties to entities or creating new relationships between entities) should be permitted in a distributed fashion. Well-known services may be provided so that the additions can be verified for its well-formedness. Modifications to the unique identifiers themselves, on the other hand, should be centrally controlled by a governing body.

 

It appears that OWL-DL is probably sufficient to be used to define such ontology. However, issues may arise during implementation that requires more expressive languages. I am hoping that more insights can be gained through discussions at the workshop. I am also interested in alternative approaches that may address the same needs for a universal gene definition ontology.