Position Paper for Semantic Web for Life Sciences Workshop
Call for a Gene Definition Ontology
Jianjun Zhang
Genentech, Inc
Semantic Web holds the promise for intelligent systems on
the Web. Relying on machine-readable resource description and web ontology, a
software agent can automatically retrieve, aggregate, and analyze information
scattered around the world. This is especially useful in the life sciences
field, where annotations and experimental findings are inherently distributed. Many
individual manual efforts have been carried out to integrate such data for
knowledge discovery purposes. With the advent of Semantic Web, it is
perceivable that most of these efforts, if not all, may be achieved through automatic
means in the future. This will greatly enhance and accelerate the life sciences
discovery process.
In order for such knowledge discovery systems to be
functional, the biological information on the web must be presented in a way so
that correlation and inference from multiple independent sources are possible.
To reach this goal, we need at the very basics, common syntactical constructs
so that the information can be machine-processed. Furthermore, in order for
inference and reasoning to happen, we also need agreement on meanings of the
terms used in the field. Defining a commonly used Web Ontology for life
sciences will go a long way to help building the foundation structure.
Many biology-specific ontology efforts have been attempted,
and many ongoing projects exist in this arena. The most widely used ontology
within the biomedical research communities is Gene Ontology (GO), among others.
However, as a bioinformatics software and tools developer, I have found that
the most needed ontology is probably a universal definition of genes. Much time
and man power has been consumed in many organizations to link various findings
on the same gene together, simply because the gene or its product is named
differently by different sources. If we have a universal gene
name definition that are adopted by various resources, this kind of
tasks would become trivial. Of course, a smart software agent will still need
to sort out the aggregated information and perform inference and reasoning. But
at least, the first step would be much easier. The existing ontologies,
such as Gene Ontology, provide definition of terms that describe a gene's
biological properties, but not genes themselves. As a first version of my wish
list, a Gene Definition Ontology should provide the following features:
1. Define clearly what a gene is.
2. Define universal unique identifiers for genes and their
products. LSID already provides a syntactic construct that may be used to build
such unique identifiers. These identifiers may be developed by a consortium,
and adopted by participating organizations. Alternatively, a service could be
created on top of an LSID implementation that links other identifiers to this
set of unique identifiers.
3. Allow relationships between genes to be presented. For
example, gene A product up-regulates gene B transcription.
Interaction with GO terms should be possible in this respect.
4. Allow relationships between gene definitions and
sequences, including genomic, cDNA, probesets, to be presented. There are already ongoing
sequence ontology projects. Coordination of the ontologies
is important to ensure compatibility and interoperability.
5. The ontology may exist for multiple species, and the
relationship between genes of different species may be presented (e.g. orthologs). Again, interactions with existing widely used
taxonomy are desirable.
6. Define a core set of terms for the life sciences domain.
GO terms probably already cover many of them, but more may be needed. These
terms are used to describe relationships of entities in this domain, just like
Dublin Core Metadata for publishing.
7. Addition of annotations (adding new properties to
entities or creating new relationships between entities) should be permitted in
a distributed fashion. Well-known services may be provided so that the
additions can be verified for its well-formedness.
Modifications to the unique identifiers themselves, on the other hand, should
be centrally controlled by a governing body.
It appears that OWL-DL is probably sufficient to be used to
define such ontology. However, issues may arise during implementation that
requires more expressive languages. I am hoping that more insights can be
gained through discussions at the workshop. I am also interested in alternative
approaches that may address the same needs for a universal
gene definition ontology.