- From: Eric Neumann <ENeumann@BeyondGenomics.com>
- Date: Mon, 3 Nov 2003 23:35:54 -0500
- To: <public-semweb-lifesci@w3.org>
- Message-ID: <FC5C355B8AE9F2499A5220CCEDC34756CF8715@bgmail.lifescience.com>
Hello, I'm posting a set of examples on how an RDF based system could assist in describing and sharing structured information related to Systems Biology. There are many examples of using RDF as a descriptive language, but I'm interested if it can be used to describe not only facts, but assumptions and hypotheses regarding mechanisms of diseases from a multi-component perspective, typical to data analysis from a Systems Biology point of view. First, one needs a mechanism for encoding most molecular data: proteins, genes, transcripts, metabolites, interactions. The genome and its constituents need to be made accessible through a descriptive system that also supports a distributed annotations model (DAS, http://biodas.org ), but extensible to all molecular species and descriptors. Second, there needs to be a robust and extensible model for accessing and including pathway data (e.g., KEGG, AfCS, BioCYC, BIND) and merging it to sets of annotated molecular data. It is hoped that efforts such as the BioPAX Ontology (http://biopax.org) will be the foundation for describing any group of pathway information, independent of source. The formalized relations between molecules, interactions, and reactions will be necessary for bridging the wide variety of biomolecular phenomena. Just as important will be the ability to aptly describe the "context" by which such phenomena are known to occur (e.g., tissue, disease, developmental stages). Third, much causal evidence is not yet in pathway databases, but exists in millions of articles of unstructured scientific text (i.e., publications). These not only need to be referenced, but the essential molecular mechanisms they describe need to be encoded in a semantic format; for example: <The Authors> <propose that>: <the termination and modulation of> <the JAK/STAT signalling pathway> <is mediated by> <tyrosine phosphatases> <the SOCS (suppressor of cytokine signalling) feedback inhibitors> and <PIAS (protein inhibitor of activated STAT) proteins> This is a nontrivial exercise, but it is possible in time, and could be achieved incrementally in stages until it became part of the publication process. This structured evidence would then be merged/layered on top of other pathway and mechanistic information, so that inferences could be performed on the set. Publishers such as Nature are already beginning to explore the use of RDF in publication space. Personally, I think text-mining can help us with legacy text, but publishers in thre future should require authors to encode the model semantics along with their text and figures using some form of wizard tool... Fourth, additional biological knowledge regarding anatomy, physiology, tissues, and diseases need also be represented in RDF in order to describe biological systems. A practical way to begin this process would be to translate the National Library of Medicine's UMLS language into RDF/OWL (http://www.nlm.nih.gov/pubs/factsheets/umls.html). From then on, all other databases should refer to biological enitites through RDF references to these entities. UMLS already contains nearly a million concepts and about 134 semantic relations, so conversion into RDF/OWL should be fairly automatic (e.g., namespace umls:). Fifth, once an interesting set of observations can be related to existing published data, the researchers should be able to propose the new relations and/or bio-mechanisms using RDF. The gathered facts and assumptions should be sufficient for any other scientist to validate the proposed hypothesis based on the presented RDF material for themselves. This is an important requirement for systems biology research, since describing and sharing complex relations and hypothetical mechanisms is at the heart of elucidating a biosystem. A distributed annotation model would also greatly enhance sharing of models and insights. It will be a test of the expressivity RDF can formally define within a field that is as encompassing as systems biology. How well the systems biology community is able to apply RDF in advancing SB research will be the true test of its utility. I intend to post more systems biology use cases and a few possible strategies on solving them in the next few weeks. I also welcome any ideas and suggestions from the life science community in what to consider and develop as part of a collaborative effort towards these goals. Eric ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Eric K. Neumann PhD VP Strategic Informatics, Head of Knowledge Research Beyond Genomics 40 Bear Hill Road Waltham, MA tel: 781-434-0222 fax: 781-895-1119 www.beyondgenomics.com
Received on Monday, 3 November 2003 23:36:24 UTC