Systems Biology Use Case from Eric Neumann on 2003-11-04 (public-semweb-lifesci@w3.org from November 2003)

From: Eric Neumann <ENeumann@BeyondGenomics.com>
Date: Mon, 3 Nov 2003 23:35:54 -0500
To: <public-semweb-lifesci@w3.org>
Message-ID: <FC5C355B8AE9F2499A5220CCEDC34756CF8715@bgmail.lifescience.com>
Hello,
 
I'm posting a set of examples on how an RDF based system could assist in
describing and sharing structured information related to Systems
Biology. There are many examples of using RDF as a descriptive language,
but I'm interested if it can be used to describe not only facts, but
assumptions and hypotheses regarding mechanisms of diseases from a
multi-component perspective, typical to data analysis from a Systems
Biology point of view.
 
First, one needs a mechanism for encoding most molecular data: proteins,
genes, transcripts, metabolites, interactions. The genome and its
constituents need to be made accessible through a descriptive system
that also supports a distributed annotations model (DAS,
http://biodas.org ), but extensible to all molecular species and
descriptors. 
 
Second, there needs to be a robust and extensible model for accessing
and including pathway data (e.g., KEGG, AfCS, BioCYC, BIND) and merging
it to sets of annotated molecular data. It is hoped that efforts such as
the BioPAX Ontology (http://biopax.org) will be the foundation for
describing any group of pathway information, independent of source. The
formalized relations between molecules, interactions, and reactions will
be necessary for bridging the wide variety of biomolecular phenomena.
Just as important will be the ability to aptly describe the "context" by
which such phenomena are known to occur (e.g., tissue, disease,
developmental stages).
 
Third, much causal evidence is not yet in pathway databases, but exists
in millions of articles of unstructured scientific text (i.e.,
publications). These not only need to be referenced, but the essential
molecular mechanisms they describe need to be encoded in a semantic
format; for example: 
 
<The Authors> <propose that>: 

<the termination and modulation of>

<the JAK/STAT signalling pathway>

<is mediated by> 

<tyrosine phosphatases> 

<the SOCS (suppressor of cytokine signalling) feedback inhibitors> and 

<PIAS (protein inhibitor of activated STAT) proteins>  

 
This is a nontrivial exercise, but it is possible in time, and could be
achieved incrementally in stages until it became part of the publication
process. This structured evidence would then be merged/layered on top of
other pathway and mechanistic information, so that inferences could be
performed on the set. Publishers such as Nature are already beginning to
explore the use of RDF in publication space. Personally, I think
text-mining can help us with legacy text, but publishers in thre future
should require authors to encode the model semantics along with their
text and figures using some form of wizard tool...
 
Fourth, additional biological knowledge regarding anatomy, physiology,
tissues, and diseases need also be represented in RDF in order to
describe biological systems. A practical way to begin this process would
be to translate the National Library of Medicine's UMLS language into
RDF/OWL (http://www.nlm.nih.gov/pubs/factsheets/umls.html). From then
on, all other databases should refer to biological enitites through RDF
references to these entities. UMLS already contains nearly a million
concepts and about 134 semantic relations, so conversion into RDF/OWL
should be fairly automatic (e.g., namespace umls:). 
 
Fifth, once an interesting set of observations can be related to
existing published data, the researchers should be able to propose the
new relations and/or bio-mechanisms using RDF. The gathered facts and
assumptions should be sufficient for any other scientist to validate the
proposed hypothesis based on the presented RDF material for themselves.
This is an important requirement for systems biology research, since
describing and sharing complex relations and hypothetical mechanisms is
at the heart of elucidating a biosystem. A distributed annotation model
would also greatly enhance sharing of models and insights. It will be a
test of the expressivity RDF can formally define within a field that is
as encompassing as systems biology. How well the systems biology
community is able to apply RDF in advancing SB research will be the true
test of its utility. 
 
I intend to post more systems biology use cases and a few possible
strategies on solving them in the next few weeks. I also welcome any
ideas and suggestions from the life science community in what to
consider and develop as part of a collaborative effort towards these
goals. 
 
 
Eric
 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
    Eric K. Neumann PhD 
    VP Strategic Informatics, 
    Head of Knowledge Research 

   Beyond Genomics 


    40 Bear Hill Road 
    Waltham, MA 
     tel: 781-434-0222 
     fax: 781-895-1119 
     www.beyondgenomics.com
Received on Monday, 3 November 2003 23:36:24 UTC