In this position paper, we will briefly present our approach to exploring semantic web technologies to augment our ongoing work in NLP and knowledge-base systems for the Life Sciences. The Hunter lab is developing tools which can (a) automatically link items with equivalent meaning from uncoordinated knowledge sources, and (b) use natural language processing techniques to extract new information from Entrez GeneRIFs and from the Gene Ontology definition fields. Although still in the early development stages of our project, we have done some preliminary work mapping NCBI resources to RDF/OWL and developing an architectural plan for further discovery.
Historically, the interpretation of experimental results has been an entirely cognitive task, done on the basis of the experience and expertise of the investigators. Computational assistance is generally limited to searching databases for information about each gene individually, e.g., in the SOURCE system [8]. However, the large numbers of genes and gene products implicated simultaneously, and the enormous and diverse body of prior work relevant to all of these genes, makes the interpretation of results increasingly vulnerable to errors and oversights that might be ameliorated by computational methods. Another challenge arises from the breakdown of disciplinary boundaries in biology. Genes that had been previously characterized in one context, say with respect to a particular disease process, are increasingly appearing as relevant in completely different contexts. For a few particularly striking examples of a very widespread phenomenon, consider the surprising apparent role of the well studied oncogene p53 in normal aging [9], and the role of embryonic and pregnancy-related genes in the cardiac response to heart failure (e.g., [10]).
The experimental community uses a great number of databases and annotation tools that contain genome-wide information to help interpret the results of high-throughput data; such databases are surveyed annually in the January issue of Nucleic Acids Research. Gene expression array analysis tools, such as NetAffx [11], provide web links to a plethora of database entries for each gene in an analysis. This approach, while certainly a key step in the interpretation of high-throughput results, tends to compound the information overload problem rather than solve it. Each analysis may generate hundreds of genes associated with the phenomenon under study; each of these genes may have useful entries in dozens of databases and have been discussed in thousands of relevant publications from quite distinct biological subspecialties
The failure of existing approaches to deal with information overload is likely to be the cause of lost or delayed opportunities for insights in any area in which high throughput technology has been applied, ranging from developmental defects [12] to aging [13] and include major public health problems such as heart disease [14], and cancer [15]. Innovative computational tools (e.g. knowledge-bases) that facilitate the interpretation of the large and diverse sets of changing genes and gene products produced by such instrumentation would be a transformative technology, helping achieve some of the as yet unfulfilled promise of the post-genomic era.
A Nature article [16] introduced the idea of conceptual biology, summarizing:
Since the introduction of the Mycin system more than 25 years ago [18], it has widely been hypothesized that extensive, well-represented computer knowledge-bases will facilitate a wide variety of scientific and clinical tasks. However, despite the investment of tens of millions of dollars and hundreds of person-years, general purpose knowledge-bases such as CYC [21] have yet to see widespread use. In contrast, recently created domain-specific knowledge-bases in areas related to genomics and related aspects of contemporary biology, such as the Gene Ontology [19], EcoCyc [20] and PharmGKB [22], have begun to become integrated into the laboratory practices of a growing number of molecular biologists. Two complementary explanations of this newfound acceptance seem likely: growing knowledge-management challenges arising from the proliferation of high-throughput instrumentation, and the relevance, quality and usefulness of the information in these knowledge-bases.
However, these successful molecular biology knowledge-bases (MBKBs) have two drawbacks that impede their more general application to addressing the challenges inherent in the new era of high throughput molecular biology. First, each has been narrowed to a particular special purpose, either in its domain of applicability (e.g. EcoCyc represents E. coli metabolism, PharmGKB represents interactions between genotypes and drug activity) or in the scope of knowledge represented (e.g. the Gene Ontology is a broad taxonomy of gene functions, but without attributes or non-hierarchical relationships among the functions represented). Second, each of these knowledge-bases was constructed largely on the basis of expensive and scarce human expertise, rather than by primarily automated systems.
Although it is possible that the success of these MBKBs is due to their narrowly tailored purpose and largely manual construction, we propose to test a contrary hypothesis, namely: Current computational technology and existing human-curated knowledge resources are sufficient to build an extensive, high-quality computational knowledge-base of molecular biology. At this phase of our research, we are not proposing to build a comprehensive MBKB, but to create and evaluate automated tools and quantify the effort necessary to construct such a MBKB. The specific aims of our investigation are to:
Given that human-curated information resources relevant to interpretation of high-throughput results are becoming increasingly interoperable, both in terms of structure and semantics - some publishers have also recently provided their data as RDF/OWL [1] [2](e.g. UniProt). - we are hopeful that the 'Semantic Web' will provide a standardized open architecture for our research.
Despite our optimism, some aspects of the current Semantic Web architecture do concern us:
RDF reification allows the use of a statement as the subject of another statement - but reification is not a complete solution to the problem. Reification can quickly lead to extreme triple bloat (e.g. when asserting provenance) - some estimates place the bloat as high as a tenfold increase. Additionally, the reified statements are often not what the user intended, since reification does not function as a quoting mechanism [7]. One solution is encoding the additional 'context' information within the statement thus creating 'quads' - consisting of an RDF triple and a URIref, blank node, or ID - this is more efficient than reification. The semantics of the fourth element can vary considerably and it's been used to refer to information sources, model/statement IDs, and more generally 'contexts'. Quads are a better solution than reification, but both still suffer from what is called the 'two-stage interpretation process', defined in the RDF Semantics [7] as:
Another solution is to use sets of 'named graphs' to define a collection of RDF graphs - each set is then named by a URIref. Named graphs are functionally similar to quads, but approach the problem differently. Named graphs define the entire scope of the RDF graph, so the open world assumption is not relevant (as it is with quads). The named graphs function as a quoting mechanism for RDF - and thus avoid the two-stage interpretation inefficiencies associated with reification and quads. Several triple stores support various approaches to the RDF context problem - e.g. Kowari [6] uses named graphs to derive quads, the fourth tuple is the group/model that a triple is associated with.
In many knowledge-bases, default inference is universal, in that any property associated with a concept is also 'inherited' by all children ('kinds-of') of that concept. In the Gene Ontology, this is called the True Path Rule:
Preliminary work has investigated benchmarking several Java based RDF persistence options (Kowari[6], Jena/Joseki[34], Sesame[35]). We are concerned that the current crop of triple stores (providing RDF persistence, query, and inference) may not be up to the task of efficiently handling the large number of triples we are anticipating.
Perhaps the most basic of the challenges for a large-scale knowledge-base is the computational capacity for it. After many years of primarily theoretical dispute, the underlying representational system now most commonly used is a combination of modular frame-based systems and first-order predicate calculus assertions. Although providing a highly expressive language for representation and inference, such systems are challenging to store efficiently. The essence of the problem is that locality of reference is poor; in other words, once a few "links" (e.g. slot/value pairs, or default inferences) are followed, the target is effectively in a random place in memory given the source. If the knowledge-base is larger than the available RAM, then speed of access is rapidly reduced to disk access rates, which is impractically slow. Visual interfaces for graph structures also present similar scalability challenges - the computational requirements for displaying the layout of 'huge graphs' (10,000+ elements) surpasses the capability of common hardware [5]. Since greater amounts of data accumulate every year, optimization strategies for various RDF operations will need to be developed.
There are two types of graphs to consider in a knowledge-base - the actual physical graph data structures, and the conceptual graphs that comprise various alternate (abstract) views that different users (e.g. biologist) employ when analyzing the data in a knowledge-base. We think many hypothetical paths through an RDF graph will need to be pre-computed through indexing strategies and multiple conceptual graphs to allow for navigation and query that is more efficient. Index structures are efficient representations for graphs, but when and where to create and modify indexes for complex graphs is not always apparent. Given that there are already many 'common views' of genomic data that can be anticipated ahead of time, the search problem can likely be improved for many conserved graph operations through a series of index lookups mapped from various conceptual graphs.
Molecular biology data exists in a chaotic space. As of January 2004, there were 548 molecular biology databases, 162 more than the year before[3].
Although there are already linkages between these resources, primarily in references to genes and/or products, capturing all of the relationships among the entities in these resources in a rich knowledge-representation is a non-trivial task. Two broad classes of issues arise: those related to multidatabase integration generally, and those specific to importing data into a more richly represented knowledge-base.
Integrating information from the plethora of databases containing information relevant to the interpretation of high-throughput results poses problems at several levels. Identifying references to the same entity in multiple databases, although a fundamental requirement, has only recently become resolvable automatically, primarily through concerted effort of the database providers to provide cross-reference information, and, as documented below, there are still substantial problems in this area. Most database integration systems, whether data warehouses or database federations, assume that this problem, often called foreign-key translation, is solved externally; systems such as SOURCE, SRS, GeneCards, and DiscoveryLink provide a uniform query interface, but do not address the semantic data fusion problem. Even a solution to the foreign key translation problem does not completely address the semantic compatibility of the information represented in the various databases. In order for data from multiple sources to be effectively integrated in analysis of high-throughput data, semantic compatibility must exist among the conditions under which data was gathered, among the terms and schemata used to represent the data, and among implicit assumptions used. Efforts on the part of the database providers, as well as community efforts on data deposition standards such as MAGE-ML/MIAME [28], and the adoption of these standards as a requirement for publication by key journals in the field, as well as the growing adoption of ontologies for standardized terms and their meanings show preliminary evidence of success in addressing this issue as well.
Initially, we are attempting semantic data integration from NCBI, UniProt, PDB, GO, BIND and the Mouse Genome Informatics databases. To achieve this goal, the first step is to provide a data integration “plug-in” framework that allows for seamless access to and interaction between the databases. Since these databases run on different platforms and provide different query systems, such a “middleware” layer is required. The data integration models for this "middleware" layer are being modeled in OWL/RDF and implemented using Java and XSLT. We are using the NCBI databases for the proof-of-concept.
An even greater increase in the computationally manipulable information about molecular function could be obtained by extracting information from the new Entrez-Gene feature, GeneRIFs. GeneRIFs are short textual excerpts from the peer-reviewed literature, each making an assertion about the function of a particular gene or gene product. Starting in the spring of 2002, all articles being cataloged by the National Library of Medicine are scanned by human indexers for statements about the function(s) of particular genes. Each GeneRIF entry contains a concise phrase describing a function or functions (less than 255 characters in length) and a reference to a published paper describing that function, as a PubMed identifier (PMID). As of September 2004, there were 80,612 GeneRIFs associated with 18,137 Entrez-Gene entries. In January 2003, there were 17,622 GeneRIFs associated with 6912 Entrez-Gene entries, a rate of increase of more than 3300 GeneRIFs per month and doubling the extent of coverage in Entrez-Gene.
In the last four years, the first widely adopted knowledge resource with genomic breadth was created: the Gene Ontology [19]. The goal of the Gene Ontology project is “to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” As evidence of its substantial impact, those two papers have been cited more than 200 times according to the ISI Science Citation Index, and 229 articles indexed in Medline contain the phrase “Gene Ontology” in their titles or abstracts.
Due to an extensive effort completed only recently, nearly all GO entries now have a definition field, which contains a short natural language text. We contend that it is possible to bring information in those definitions into computationally accessible form and that doing so would significantly increase the value of the resource.
Although ontologies and knowledge-bases are not always clearly distinguished, in the artificial intelligence literature an ontology is differentiated from a knowledge-base in terms of the types of properties and relationships that are included. An ontology specifies a set of well-defined concepts, generally related in a hierarchical taxonomy (i.e. the symmetric relationship has-kinds/is-a) and possibly also in a partonomy (i.e. the symmetric relationship has-parts/part-of).
The existence of GO could help with the ontological problem of knowledge base creation by specifying a carefully curated list of concepts, and linking those concepts to specific genes and gene products. However, a knowledge-base extends an ontology to include additional relationships and properties of the represented concepts which are domain-specific, and not generally applicable to all elements of the knowledge-base. For example, in molecular biology, the relationship phosphorylates-substrate and phosphorylated-by could relate kinases to their substrates, but would not be applicable to other entities. The task of explicitly representing the relationships that human beings appreciate among all of those concepts, which is central to the effective computational use of a knowledge-base, still remains. In order to create a knowledge-base with the depth of EcoCyc (which represents dozens of different relationships among a few hundreds of upper-level concepts) and the breadth of GO (which represents just two types of relationships, but among tens of thousands of upper-level concepts), we propose to use natural language processing techniques and semantic database integration to substantially enrich the GO with many new relationships, both among existing GO concepts, and between gene products and the elaborated knowledge-base.
Our investigation and NCBI RDF mappings are still work in progress. Future discovery through the development of our data integration framework, and the alignment of our efforts with other ontologies and additional data sources will reveal whether or not our approach was useful and viable.
[1] Eric Miller, Jim Hendler eds. (2004): Web Ontology Language (OWL). Available at http://www.w3.org/2004/OWL/#specs
[2] Frank Manola, Eric Miller, eds. (2004): Resource Description Framework (RDF) Available at http://www.w3.org/2004/RDF
[3] Michael Y. Galperin, NCBI, Bethesda, MD(2004): The Molecular Biology Database Collection: 2004 update Available at http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D3
[4] Andy Seaborne, HP Laboratories, Bristol, UK (2003): RDQL - RDF Data Query Language Available at http://www.hpl.hp.com/semweb/rdql.htm
[5] M.S. Marshall, I. Herman, G. Melançon (2003): An Object Oriented Design for Graph Visualization Available at http://gvf.sourceforge.net/GVF.pdf
[6] Kowari - RDF Store Available at http://www.kowari.org/
[7] P. Hayes, W3C, (2004): RDF Semantics Available at http://www.w3.org/TR/rdf-mt/
[8] Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA. , Nucleic Acids Res. (2003): SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data.
[9] Tyner SD, Venkatachalam S, Choi J, Jones S, Ghebranious N, Igelmann H, Lu X, Soron G, Cooper B, Brayton C, Hee Park S, Thompson T, Karsenty G, Bradley A, Donehower LA. , Nature, (2002): p53 mutant mice that display early ageing-associated phenotypes.
[10] Dschietzig T, Bartsch C, Richter C, Laule M, Baumann G, Stangl K., Circ Res. (2003): Relaxin, a pregnancy hormone, is a functional endothelin-1 antagonist: attenuation of endothelin-1-mediated vasoconstriction by stimulation of endothelin type-B receptor expression via ERK-1/2 and nuclear factor-kappaB.
[11] Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA., Nucleic Acids Res. (2003): NetAffx: Affymetrix probesets and annotations.
[12] Srivastava D., Curr Opin Cardiol. (1999): Developmental and genetic aspects of congenital heart disease.
[13] Kirschner M, Pujol G, Radu A., Biochem Biophys Res Commun. (2002): Oligonucleotide microarray data mining: search for age-dependent gene expression.
[14] Hwang JJ, Dzau VJ, Liew CC. , Curr Cardiol Rep. (2001): Genomics and the pathophysiology of heart failure.
[15] Yeatman TJ., Am Surg. (2003): The future of clinical cancer management: one tumor, one chip.
[16] Blagosklonny MV, Pardee AB. , Nature. (2002) : Conceptual biology: unearthing the gems.
[17] Barnes JC. , Nature. (2002): Conceptual biology: a semantic issue and more.
[18] Wraith SM, Aikins JS, Buchanan BG, Clancey WJ, Davis R, Fagan LM, Hannigan JF, Scott AC, Shortliffe EH, van Melle WJ, Yu VL, Axline SG, Cohen SN. , Am J Hosp Pharm. (1976): Computerized consultation system for selection of antimicrobial therapy.
[19] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G., Nat Genet. (2000): Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
[20] Karp PD, Riley M, Paley SM, Pelligrini-Toole A. , Nucleic Acids Res. (1996): EcoCyc: an encyclopedia of Escherichia coli genes and metabolism.
[21] Douglas B. Lenat, Commun. ACM (1995): CYC: A Large-Scale Investment in Knowledge Infrastructure.
[22] Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE., Nucleic Acids Res. (2002): PharmGKB: the Pharmacogenetics Knowledge Base.
[23] Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. , Nucleic Acids Res., (2004): UniProt: the Universal Protein knowledgebase.
[24] Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L. , Nucleic Acids Res., (2004): Database resources of the National Center for Biotechnology Information: update.
[25] Westbrook J, Feng Z, Chen L, Yang H, Berman HM., Nucleic Acids Res., (2003): The Protein Data Bank and structural genomics.
[26] Bader GD, Betel D, Hogue CW. , Nucleic Acids Res., (2003): BIND: the Biomolecular Interaction Network Database.
[27] Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT; Mouse Genome Database Group., Nucleic Acids Res., (2003): MGD: the Mouse Genome Database.
[28] Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ Jr, Brazma A. , Genome Biol. (2002): Design and implementation of microarray gene expression markup language (MAGE-ML).
[29] Hanisch D, Fluck J, Mevissen HT, Zimmer R. , Pac Symp Biocomput., (2003): Playing biology's name game: identifying protein names in scientific text.
[30] Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. , Bioinformatics, (2003): Protein structures and information extraction from biological texts: the PASTA system.
[31] George Acquaah-Mensah, et. al, Pre-Press,(2004): From Molecules to Behavior: Biognosticopoeia, a Knowledge Base for Pharmacology
[32] Protege 2000 Editor Available at http://protege.stanford.edu
[33] SWRL Available at http://www.w3.org/Submission/SWRL/
[34] Jena – Semantic Web Framework for Java Available at http://jena.sourceforge.net/
[35] Sesame - RDF database Available at http://www.openrdf.org/
[36] Libbus B, Rindflesch TC., Proc AMIA Symp. (2002): NLP-based information extraction for managing the molecular biology literature.
[37] Jess - Java Rule Engine Available at http://herzberg.ca.sandia.gov/jess/