Exploring Semantic Web Infrastructure for Life Science Knowledge-bases

Position Paper for the W3C Workshop on Semantic Web for Life Sciences

27-28 October 2004, Cambridge, Massachusetts USA

Authors:

Last Changed:

15 September 2004

Content:

Abstract

In this position paper, we will briefly present our approach to exploring semantic web technologies to augment our ongoing work in NLP and knowledge-base systems for the Life Sciences. The Hunter lab is developing tools which can (a) automatically link items with equivalent meaning from uncoordinated knowledge sources, and (b) use natural language processing techniques to extract new information from Entrez GeneRIFs and from the Gene Ontology definition fields. Although still in the early development stages of our project, we have done some preliminary work mapping NCBI resources to RDF/OWL and developing an architectural plan for further discovery.

1. Introduction

Historically, the interpretation of experimental results has been an entirely cognitive task, done on the basis of the experience and expertise of the investigators. Computational assistance is generally limited to searching databases for information about each gene individually, e.g., in the SOURCE system [8]. However, the large numbers of genes and gene products implicated simultaneously, and the enormous and diverse body of prior work relevant to all of these genes, makes the interpretation of results increasingly vulnerable to errors and oversights that might be ameliorated by computational methods. Another challenge arises from the breakdown of disciplinary boundaries in biology. Genes that had been previously characterized in one context, say with respect to a particular disease process, are increasingly appearing as relevant in completely different contexts. For a few particularly striking examples of a very widespread phenomenon, consider the surprising apparent role of the well studied oncogene p53 in normal aging [9], and the role of embryonic and pregnancy-related genes in the cardiac response to heart failure (e.g., [10]).

The experimental community uses a great number of databases and annotation tools that contain genome-wide information to help interpret the results of high-throughput data; such databases are surveyed annually in the January issue of Nucleic Acids Research. Gene expression array analysis tools, such as NetAffx [11], provide web links to a plethora of database entries for each gene in an analysis. This approach, while certainly a key step in the interpretation of high-throughput results, tends to compound the information overload problem rather than solve it. Each analysis may generate hundreds of genes associated with the phenomenon under study; each of these genes may have useful entries in dozens of databases and have been discussed in thousands of relevant publications from quite distinct biological subspecialties

The failure of existing approaches to deal with information overload is likely to be the cause of lost or delayed opportunities for insights in any area in which high throughput technology has been applied, ranging from developmental defects [12] to aging [13] and include major public health problems such as heart disease [14], and cancer [15]. Innovative computational tools (e.g. knowledge-bases) that facilitate the interpretation of the large and diverse sets of changing genes and gene products produced by such instrumentation would be a transformative technology, helping achieve some of the as yet unfulfilled promise of the post-genomic era.

A Nature article [16] introduced the idea of conceptual biology, summarizing:

"Millions of easily retrievable facts are being accumulated in databases, from a variety of sources in seemingly unrelated fields, and from thousands of journals. New knowledge can be generated by 'reviewing' these accumulated results in a concept-driven manner, linking them into testable chains and networks."
A letter to Nature in the following issue [17] noted a significant challenge for conceptual biology:
"What is still needed is a way to control the context of the [conceptual] search, so that terms having different meanings in different contexts can be retrieved appropriately. We also need ways to enable scientists to cross disciplines and search in areas outside their expertise, so that they can extract information critical for new discoveries. Knowledge-based systems will no doubt provide the best opportunity in this regard."

Since the introduction of the Mycin system more than 25 years ago [18], it has widely been hypothesized that extensive, well-represented computer knowledge-bases will facilitate a wide variety of scientific and clinical tasks. However, despite the investment of tens of millions of dollars and hundreds of person-years, general purpose knowledge-bases such as CYC [21] have yet to see widespread use. In contrast, recently created domain-specific knowledge-bases in areas related to genomics and related aspects of contemporary biology, such as the Gene Ontology [19], EcoCyc [20] and PharmGKB [22], have begun to become integrated into the laboratory practices of a growing number of molecular biologists. Two complementary explanations of this newfound acceptance seem likely: growing knowledge-management challenges arising from the proliferation of high-throughput instrumentation, and the relevance, quality and usefulness of the information in these knowledge-bases.

However, these successful molecular biology knowledge-bases (MBKBs) have two drawbacks that impede their more general application to addressing the challenges inherent in the new era of high throughput molecular biology. First, each has been narrowed to a particular special purpose, either in its domain of applicability (e.g. EcoCyc represents E. coli metabolism, PharmGKB represents interactions between genotypes and drug activity) or in the scope of knowledge represented (e.g. the Gene Ontology is a broad taxonomy of gene functions, but without attributes or non-hierarchical relationships among the functions represented). Second, each of these knowledge-bases was constructed largely on the basis of expensive and scarce human expertise, rather than by primarily automated systems.

Although it is possible that the success of these MBKBs is due to their narrowly tailored purpose and largely manual construction, we propose to test a contrary hypothesis, namely: Current computational technology and existing human-curated knowledge resources are sufficient to build an extensive, high-quality computational knowledge-base of molecular biology. At this phase of our research, we are not proposing to build a comprehensive MBKB, but to create and evaluate automated tools and quantify the effort necessary to construct such a MBKB. The specific aims of our investigation are to:

Given that human-curated information resources relevant to interpretation of high-throughput results are becoming increasingly interoperable, both in terms of structure and semantics - some publishers have also recently provided their data as RDF/OWL [1] [2](e.g. UniProt). - we are hopeful that the 'Semantic Web' will provide a standardized open architecture for our research.

2. Semantic Web Concerns

Despite our optimism, some aspects of the current Semantic Web architecture do concern us:

2.1 Inference

Is RDF reification sufficient for 'context' in large graphs?
RDF defines a big graph, but does not provide a good way to set context within the graph. One use case is 'attribution' - e.g. there might be a need to mark all statements derived from a particular RDF document. Another use case is the issue of 'trust' in the semantic web, and asserting the origin/provenance of a statement - trust has not been formally addressed by the W3C yet. We would like to efficiently associate triple statements with contextual information, and think this will be required as graphs grow and are integrated with other graphs.

RDF reification allows the use of a statement as the subject of another statement - but reification is not a complete solution to the problem. Reification can quickly lead to extreme triple bloat (e.g. when asserting provenance) - some estimates place the bloat as high as a tenfold increase. Additionally, the reified statements are often not what the user intended, since reification does not function as a quoting mechanism [7]. One solution is encoding the additional 'context' information within the statement thus creating 'quads' - consisting of an RDF triple and a URIref, blank node, or ID - this is more efficient than reification. The semantics of the fourth element can vary considerably and it's been used to refer to information sources, model/statement IDs, and more generally 'contexts'. Quads are a better solution than reification, but both still suffer from what is called the 'two-stage interpretation process', defined in the RDF Semantics [7] as:

"one has to interpret the reified node - the subject of the triples in the reification - to refer to another triple, then treat that triple as RDF syntax and apply the interpretation mapping again to get to the referent of its subject"

Another solution is to use sets of 'named graphs' to define a collection of RDF graphs - each set is then named by a URIref. Named graphs are functionally similar to quads, but approach the problem differently. Named graphs define the entire scope of the RDF graph, so the open world assumption is not relevant (as it is with quads). The named graphs function as a quoting mechanism for RDF - and thus avoid the two-stage interpretation inefficiencies associated with reification and quads. Several triple stores support various approaches to the RDF context problem - e.g. Kowari [6] uses named graphs to derive quads, the fourth tuple is the group/model that a triple is associated with.

How is 'negation' modeled in the semantic web?
Previous work in the lab produced a knowledge-base, called Biognosticopoeia[31], covering many aspects of molecular pharmacology, particularly as related to signal transduction (with an emphasis on the MAPK pathways) and molecular pharmacology. Biognosticopoeia is currently stored in a LISP frame-based system (the Conceptual Memory (CM) from i/Net). We are evaluating the potential for mapping the Biognosticopoeia knowledge-base into RDF/OWL, but the knowledge-base makes extensive use of negation, and it's not yet clear how to express our use of negation in OWL/RDF.

In many knowledge-bases, default inference is universal, in that any property associated with a concept is also 'inherited' by all children ('kinds-of') of that concept. In the Gene Ontology, this is called the True Path Rule:

"The pathway from a child term all the way up to its top level parent(s) must always be true. Often, annotating a new gene product reveals relationships in an ontology that break this rule, or species specificity becomes a problem. In such cases, the ontology must be restructured by adding more nodes and connecting terms such that any path upwards is true."
However, when an ontology is extended with many additional properties and relationships among entities, it becomes increasingly difficult to avoid instances where an occasional default inference is incorrect, despite its general applicability. A relevant example from molecular biology can be found in trying to represent the calpain family of calcium-dependent cytosolic cysteine proteases. As noted in its GO annotations, the family has the molecular function endopeptidase. However, while the human protein Calpain 6 is ostensibly a member of the family (it is 47% identical with Calpain 5), it does not exhibit any proteolytic activity, due to a lack of a catalytic cysteine at the active site. Without negation, either the endopeptidase activity would have to be removed from the family definition, or Calpain 6 would have to be removed from the family. Since the ‘closed world’ assumption is not applicable in molecular biology, it is important to be able to represent explicit assertions that a fact does not hold.

Are description logics sufficient for establishing meaningful inference?
OWL description logics use 'open world' reasoning, in which negation means that something is provably false. OWL alone will not support all of our inference needs - nor do we think the authors intended for OWL to be an all-encompassing solution to inference - but we do see it as a valuable tool for certain situations. In particular, description logic reasoning may be helpful when building/maintaining our models, and for handling potential combinatorial explosions that might arise from pre-coordination. Efforts are under way in the Semantic Web community [33] to develop a rules language to fill the need for a 'closed world' reasoning option - i.e. if something cannot be proven to be true, then it is considered false. Despite the lack of a formal specification for a W3C rules language, several options are available. We have been investigating the Jess rule engine to fill this void - a mature Rete algorithm based engine that supports forward and (some) backward chaining[37]. A combination of various logics will be needed for our applications.

2.2 Scalability

Preliminary work has investigated benchmarking several Java based RDF persistence options (Kowari[6], Jena/Joseki[34], Sesame[35]). We are concerned that the current crop of triple stores (providing RDF persistence, query, and inference) may not be up to the task of efficiently handling the large number of triples we are anticipating.

Perhaps the most basic of the challenges for a large-scale knowledge-base is the computational capacity for it. After many years of primarily theoretical dispute, the underlying representational system now most commonly used is a combination of modular frame-based systems and first-order predicate calculus assertions. Although providing a highly expressive language for representation and inference, such systems are challenging to store efficiently. The essence of the problem is that locality of reference is poor; in other words, once a few "links" (e.g. slot/value pairs, or default inferences) are followed, the target is effectively in a random place in memory given the source. If the knowledge-base is larger than the available RAM, then speed of access is rapidly reduced to disk access rates, which is impractically slow. Visual interfaces for graph structures also present similar scalability challenges - the computational requirements for displaying the layout of 'huge graphs' (10,000+ elements) surpasses the capability of common hardware [5]. Since greater amounts of data accumulate every year, optimization strategies for various RDF operations will need to be developed.

There are two types of graphs to consider in a knowledge-base - the actual physical graph data structures, and the conceptual graphs that comprise various alternate (abstract) views that different users (e.g. biologist) employ when analyzing the data in a knowledge-base. We think many hypothetical paths through an RDF graph will need to be pre-computed through indexing strategies and multiple conceptual graphs to allow for navigation and query that is more efficient. Index structures are efficient representations for graphs, but when and where to create and modify indexes for complex graphs is not always apparent. Given that there are already many 'common views' of genomic data that can be anticipated ahead of time, the search problem can likely be improved for many conserved graph operations through a series of index lookups mapped from various conceptual graphs.

3. Current Work

3.1 Data Integration

Molecular biology data exists in a chaotic space. As of January 2004, there were 548 molecular biology databases, 162 more than the year before[3].

Although there are already linkages between these resources, primarily in references to genes and/or products, capturing all of the relationships among the entities in these resources in a rich knowledge-representation is a non-trivial task. Two broad classes of issues arise: those related to multidatabase integration generally, and those specific to importing data into a more richly represented knowledge-base.

Integrating information from the plethora of databases containing information relevant to the interpretation of high-throughput results poses problems at several levels. Identifying references to the same entity in multiple databases, although a fundamental requirement, has only recently become resolvable automatically, primarily through concerted effort of the database providers to provide cross-reference information, and, as documented below, there are still substantial problems in this area. Most database integration systems, whether data warehouses or database federations, assume that this problem, often called foreign-key translation, is solved externally; systems such as SOURCE, SRS, GeneCards, and DiscoveryLink provide a uniform query interface, but do not address the semantic data fusion problem. Even a solution to the foreign key translation problem does not completely address the semantic compatibility of the information represented in the various databases. In order for data from multiple sources to be effectively integrated in analysis of high-throughput data, semantic compatibility must exist among the conditions under which data was gathered, among the terms and schemata used to represent the data, and among implicit assumptions used. Efforts on the part of the database providers, as well as community efforts on data deposition standards such as MAGE-ML/MIAME [28], and the adoption of these standards as a requirement for publication by key journals in the field, as well as the growing adoption of ontologies for standardized terms and their meanings show preliminary evidence of success in addressing this issue as well.

Initially, we are attempting semantic data integration from NCBI, UniProt, PDB, GO, BIND and the Mouse Genome Informatics databases. To achieve this goal, the first step is to provide a data integration “plug-in” framework that allows for seamless access to and interaction between the databases. Since these databases run on different platforms and provide different query systems, such a “middleware” layer is required. The data integration models for this "middleware" layer are being modeled in OWL/RDF and implemented using Java and XSLT. We are using the NCBI databases for the proof-of-concept.

3.2 NLP Enrichment Opportunities

The increasing availability of cross-references among molecular biology databases and the progress in semantic compatibility do not, in themselves, solve the problem of creating a broad, deep MBKB. While semantically compatible knowledge sources provide a large amount of information, they are far from complete, even with respect to existing knowledge as embodied in the biomedical literature. Furthermore, identifying references to particular genes or products in the biomedical literature is a highly challenging task [29]. Progress in the capacity of systems for the extraction of information from text (e.g. [30]; [36]) suggest that in certain restricted contexts, it may be possible for technology to extend the breadth of coverage of existing knowledge sources via information extraction. We contend that we could dramatically improve the comprehensiveness and depth of molecular biology knowledge-bases by extracting textual information that is present in two particular highly restricted sources: GO definition fields and Entrez GeneRIFs (Gene References Into Function). Currently, we are developing a Protege[32] plug-in to facilitate NLP related annotation of these texts.

3.2.1 Entrez GeneRIFs

An even greater increase in the computationally manipulable information about molecular function could be obtained by extracting information from the new Entrez-Gene feature, GeneRIFs. GeneRIFs are short textual excerpts from the peer-reviewed literature, each making an assertion about the function of a particular gene or gene product. Starting in the spring of 2002, all articles being cataloged by the National Library of Medicine are scanned by human indexers for statements about the function(s) of particular genes. Each GeneRIF entry contains a concise phrase describing a function or functions (less than 255 characters in length) and a reference to a published paper describing that function, as a PubMed identifier (PMID). As of September 2004, there were 80,612 GeneRIFs associated with 18,137 Entrez-Gene entries. In January 2003, there were 17,622 GeneRIFs associated with 6912 Entrez-Gene entries, a rate of increase of more than 3300 GeneRIFs per month and doubling the extent of coverage in Entrez-Gene.

3.2.3 Gene Ontology Definitions

In the last four years, the first widely adopted knowledge resource with genomic breadth was created: the Gene Ontology [19]. The goal of the Gene Ontology project is “to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.” As evidence of its substantial impact, those two papers have been cited more than 200 times according to the ISI Science Citation Index, and 229 articles indexed in Medline contain the phrase “Gene Ontology” in their titles or abstracts.

Due to an extensive effort completed only recently, nearly all GO entries now have a definition field, which contains a short natural language text. We contend that it is possible to bring information in those definitions into computationally accessible form and that doing so would significantly increase the value of the resource.

Although ontologies and knowledge-bases are not always clearly distinguished, in the artificial intelligence literature an ontology is differentiated from a knowledge-base in terms of the types of properties and relationships that are included. An ontology specifies a set of well-defined concepts, generally related in a hierarchical taxonomy (i.e. the symmetric relationship has-kinds/is-a) and possibly also in a partonomy (i.e. the symmetric relationship has-parts/part-of).

The existence of GO could help with the ontological problem of knowledge base creation by specifying a carefully curated list of concepts, and linking those concepts to specific genes and gene products. However, a knowledge-base extends an ontology to include additional relationships and properties of the represented concepts which are domain-specific, and not generally applicable to all elements of the knowledge-base. For example, in molecular biology, the relationship phosphorylates-substrate and phosphorylated-by could relate kinases to their substrates, but would not be applicable to other entities. The task of explicitly representing the relationships that human beings appreciate among all of those concepts, which is central to the effective computational use of a knowledge-base, still remains. In order to create a knowledge-base with the depth of EcoCyc (which represents dozens of different relationships among a few hundreds of upper-level concepts) and the breadth of GO (which represents just two types of relationships, but among tens of thousands of upper-level concepts), we propose to use natural language processing techniques and semantic database integration to substantially enrich the GO with many new relationships, both among existing GO concepts, and between gene products and the elaborated knowledge-base.

Outlook

Our investigation and NCBI RDF mappings are still work in progress. Future discovery through the development of our data integration framework, and the alignment of our efforts with other ontologies and additional data sources will reveal whether or not our approach was useful and viable.

References

[1] Eric Miller, Jim Hendler eds. (2004): Web Ontology Language (OWL). Available at http://www.w3.org/2004/OWL/#specs

[2] Frank Manola, Eric Miller, eds. (2004): Resource Description Framework (RDF) Available at http://www.w3.org/2004/RDF

[3] Michael Y. Galperin, NCBI, Bethesda, MD(2004): The Molecular Biology Database Collection: 2004 update Available at http://nar.oupjournals.org/cgi/content/full/32/suppl_1/D3

[4] Andy Seaborne, HP Laboratories, Bristol, UK (2003): RDQL - RDF Data Query Language Available at http://www.hpl.hp.com/semweb/rdql.htm

[5] M.S. Marshall, I. Herman, G. Melançon (2003): An Object Oriented Design for Graph Visualization Available at http://gvf.sourceforge.net/GVF.pdf

[6] Kowari - RDF Store Available at http://www.kowari.org/

[7] P. Hayes, W3C, (2004): RDF Semantics Available at http://www.w3.org/TR/rdf-mt/

[8] Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA. , Nucleic Acids Res. (2003): SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data.

[9] Tyner SD, Venkatachalam S, Choi J, Jones S, Ghebranious N, Igelmann H, Lu X, Soron G, Cooper B, Brayton C, Hee Park S, Thompson T, Karsenty G, Bradley A, Donehower LA. , Nature, (2002): p53 mutant mice that display early ageing-associated phenotypes.

[10] Dschietzig T, Bartsch C, Richter C, Laule M, Baumann G, Stangl K., Circ Res. (2003): Relaxin, a pregnancy hormone, is a functional endothelin-1 antagonist: attenuation of endothelin-1-mediated vasoconstriction by stimulation of endothelin type-B receptor expression via ERK-1/2 and nuclear factor-kappaB.

[11] Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA., Nucleic Acids Res. (2003): NetAffx: Affymetrix probesets and annotations.

[12] Srivastava D., Curr Opin Cardiol. (1999): Developmental and genetic aspects of congenital heart disease.

[13] Kirschner M, Pujol G, Radu A., Biochem Biophys Res Commun. (2002): Oligonucleotide microarray data mining: search for age-dependent gene expression.

[14] Hwang JJ, Dzau VJ, Liew CC. , Curr Cardiol Rep. (2001): Genomics and the pathophysiology of heart failure.

[15] Yeatman TJ., Am Surg. (2003): The future of clinical cancer management: one tumor, one chip.

[16] Blagosklonny MV, Pardee AB. , Nature. (2002) : Conceptual biology: unearthing the gems.

[17] Barnes JC. , Nature. (2002): Conceptual biology: a semantic issue and more.

[18] Wraith SM, Aikins JS, Buchanan BG, Clancey WJ, Davis R, Fagan LM, Hannigan JF, Scott AC, Shortliffe EH, van Melle WJ, Yu VL, Axline SG, Cohen SN. , Am J Hosp Pharm. (1976): Computerized consultation system for selection of antimicrobial therapy.

[19] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G., Nat Genet. (2000): Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

[20] Karp PD, Riley M, Paley SM, Pelligrini-Toole A. , Nucleic Acids Res. (1996): EcoCyc: an encyclopedia of Escherichia coli genes and metabolism.

[21] Douglas B. Lenat, Commun. ACM (1995): CYC: A Large-Scale Investment in Knowledge Infrastructure.

[22] Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE., Nucleic Acids Res. (2002): PharmGKB: the Pharmacogenetics Knowledge Base.

[23] Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. , Nucleic Acids Res., (2004): UniProt: the Universal Protein knowledgebase.

[24] Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L. , Nucleic Acids Res., (2004): Database resources of the National Center for Biotechnology Information: update.

[25] Westbrook J, Feng Z, Chen L, Yang H, Berman HM., Nucleic Acids Res., (2003): The Protein Data Bank and structural genomics.

[26] Bader GD, Betel D, Hogue CW. , Nucleic Acids Res., (2003): BIND: the Biomolecular Interaction Network Database.

[27] Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT; Mouse Genome Database Group., Nucleic Acids Res., (2003): MGD: the Mouse Genome Database.

[28] Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ Jr, Brazma A. , Genome Biol. (2002): Design and implementation of microarray gene expression markup language (MAGE-ML).

[29] Hanisch D, Fluck J, Mevissen HT, Zimmer R. , Pac Symp Biocomput., (2003): Playing biology's name game: identifying protein names in scientific text.

[30] Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. , Bioinformatics, (2003): Protein structures and information extraction from biological texts: the PASTA system.

[31] George Acquaah-Mensah, et. al, Pre-Press,(2004): From Molecules to Behavior: Biognosticopoeia, a Knowledge Base for Pharmacology

[32] Protege 2000 Editor Available at http://protege.stanford.edu

[33] SWRL Available at http://www.w3.org/Submission/SWRL/

[34] Jena – Semantic Web Framework for Java Available at http://jena.sourceforge.net/

[35] Sesame - RDF database Available at http://www.openrdf.org/

[36] Libbus B, Rindflesch TC., Proc AMIA Symp. (2002): NLP-based information extraction for managing the molecular biology literature.

[37] Jess - Java Rule Engine Available at http://herzberg.ca.sandia.gov/jess/