- From: William Bug <William.Bug@DrexelMed.edu>
- Date: Mon, 19 Jun 2006 15:08:49 -0400
- To: "Xiaoshu Wang" <wangxiao@musc.edu>
- Cc: "'Alan Ruttenberg'" <alanruttenberg@gmail.com>, <public-semweb-lifesci@w3.org>
Hi All, First, I'd like to recommend two articles I believe are very relevant to this discussion and may help provide us a clearer sense of how to proceed here: 1) X. Wang, Robert Gorlitsky, and Jonas S Almeida, From XML to RDF: how semantic web technologies will change the design of 'omic' standards (2005) Nat. Biotech., v23, n9, p1099 (Xiaoshu is first author on this - I know this may be bringing "coals to Newcastle," but if there are some on this list who have not read this article, I'd strongly recommend they do.) 2) G.V. Gkoutos, E. C. J. Green, S. Greenway, A. Blank, A.-M. Mallon, J.M. Hancock, CRAVE: A database, middleware, and visualizaiton system for phenotype ontologies (2005) Bioinformatics, v21, n7, p1257. I'll explain below where I see these fitting in to this discussion. I would maintain the "social" issue we are addressing here is the shared, community view of the lower-levels of the semantic graph, which is very much related to what we need to have the machine algorithms parse. I would also maintain the community efforts to produce a "formal form" (of the semantics) in the relevant domains of biomedical knowledge are very much separate from research fields focused on analyzing natural language expressions. A "string of natural language" is an instance of a lexical "view" of A formal form of semantics - the formal semantic graph existing (somewhere) in the author(s) brain(s) - which may or may not conform to the shared formal semantic frameworks being developed by the community to cover specific knowledge domains in biomedicine. RDF is particularly good at providing a formal way of making explicit the many semantic relations (explicit and implied) in a phrase of natural language, but that doesn't mean when we talk about semantic representation in RDF, we are always talking about representing natural language expressions. Dealing with natural language is critical when parsing meaning from existing scientific articles, but here I believe we are more focussed on coming up with a means by which we can specify/ identify the semantic entities related to instances in data repositories, as opposed to only dealing with that which is extrapolated from parsing the literature. It is very important not to confuse the lexicon with an ontology. The way I like to think of it, the lexicon is to an ontology, as a SQL VIEW is to your core data model. The "view" contains a subset of the relational content consistent with the more complete abstract model, but the goal of the view is to interface to a particular application requirement, and thereby makes some compromises that can be very application specific, and not necessarily reflect the underlying model assertions. This is not just semantics. ;-) As several people on this list could easily expound on much better than I, there is a world of difference between the computational linguistic fields focussed on deducing semantic content from natural language strings and the formal ontological efforts to derive a foundational semantic framework for biomedical KR. One would hope the Knowledge Extraction process performed by the computational linguists can be made to converge with (or used to re-architect when necessary) the shared community semantic graph, but the two are certainly not synonymous. I would agree with what I believe Alan pointed out - it is a very complex issue to resolve the difference between the semantics associated with a particular data instance (e.g., a somatically re- combined sequence in a specific patient that lead through a very complex biochemical & morphogenetic process to a specific neoplasm) to the related, higher-level shared semantic descriptions (e.g., of the gene in which that mutation took place). I don't know we can expect to resolve that issue given our current limited scope. I do feel its a critical issue to the overall goal of using semantic web descriptions of resources (including primary data) to drive new knowledge discovery. As I mentioned in the phone call, one of the ways these issues can be resolved is via the use of semantically-based mediation technology. As someone else pointed out in the call, it is really untenable to attempt to warehouse all the data needed for field wide, higher order data repositories beyond the biomolecular. In many ways, even in the biomolecular domain, we've outgrown warehousing. PDB, SwissProt, GENBANK - a great deal of bioinformatics work that focusses on content in these warehouses is targeting integration across repositories and links to other, newer emerging repositories. In many ways, this is a tasks semantic web tech is most suited to support (see the paper by Xiaoshu cited above). To make this work, mediation technology requires an alignment of participating repository schemas. This can rarely be done effectively without referring to some shared, community semantic framework. Of course, this also requires, as Xiaoshu has pointed out on this thread, a more explicit statement of the "processing contract" - an extremely thorny issue you cannot avoid when you are actually trying to do something with RDF content. Requirement for semantic data mediation is definitely required in the neuro-domain. In the BIRN project, we have a data mediator used to link across the 40+ disparate research lab repositories of primary and reduced data (Luis Marenco, Gordon, Perry Miller, and Kei are also developing a mediator framework at Yale, I believe). There is a BIRN mediator "registration" process each participant lab needs to go through to link their data to the mediator. The goal is the mediator would resolve queries made to the BIRN portal into sub-queries across the participating "registered" databases. Though initially only minimally tied to a semantic description of the source repositories, it's now clear a critical part of this process is for the source databases to map the entities in their schema to the appropriate higher-level, semantic entities to which they refer, drawing on a shared semantic framework for the domain of neurodegerative disease being developed by the BIRN Ontology Task Force (I am a member of the BIRN OTF). In the course of developing the BIRN shared semantic framework, we've begun to establish a set of "best practices" at least for what we are doing within BIRN that appears to be specifically applicable to this topic under discussion: 1) Re-use existing knowledge resources whenever possible. This extends from flat term lists (gene names), to integrated lexical graph indexes such as NeuroNames, on through to formally complete ontologies such as the Foundational Model of Anatomy (FMA). We are rarely able to use these "as is", yet it is clear by making the effort to examine where we need to adapt these resources, we expend nearly an order of magnitude less resources than were we to refashion the resource from first principles ourselves. Often you can still use a given domain ontology's formalism, even when the ontology itself doesn't provide you with the requisite granularity you require. Using the same - or a compatible formalism - at least holds out the possibility of later integrating what you create into the community resource. 2) Most all of the semantic information we expect to expose to the mediator can be reduced to an elemental view - that of measurements made in the course of an investigation meant to specify (quantitatively or qualitatively) phenotypic traits. This is true of spatially-mapped, CNS gene or protein expression data (e.g., Allen Brain Atlas, GENSAT, Desmond Smiths "voxelized" microarray data sets), as well as for assays of behavior and cognition which pervade the human focussed, neuroimaging projects within BIRN. With this in mind, we came to the understanding that it is important: a) to use a shared foundational ontology (we are trying to use the BFO model beginning to be adopted by many biomed ontology efforts - e.g., FuGO, FMA) and a community-shared, collection of semantic relations (again - we are converging on the OBO Relations ontology - http://obo.sourceforge.net/relationship/ - another article worth reading) b) to develop a means of formal phenotypic attribute description more flexible and capable of evolving than the current approaches in use by the community - e.g., use of the Mammalian Phenotype Ontology by the GO folks at the Jackson Labs (http://www.informatics.jax.org/ menus/vocab_menu.shtml). These "pre-coordinated" views of complex knowledge domains are very useful when you are providing a user interface for human literature curators (as GO & MGI do with MPO), but they don't provide for algorithms re-combining the more elemental semantic aspects represented in these "flattened" views. Both in the case of disease and phenotype in general there is also a need to be more specifically tied to the observations extrapolated from the primary data. This is where the second citation above comes in (CRAVE - application using PATO). Using PATO with FuGO (once FuGO spreads to cover assays, devices, & reagents outside of gene & protein expression, as it is gradually moving toward), one can build a semantically well defined, description of phenotype maintaining the integrity of the semantic links both to the primary data AND to the shared, higher-level, semantic frameworks in the community. It's important to note many of these efforts - use of BFO, FuGO, PATO, etc. - are really quite new. PATO itself is so new, it's definition/specification is a bit of a moving target. Having said this, using the general approach outlined above takes into account many hard learned lessons accumulated over the last few decades in the field of biomedical KR. It also appears from our vantage within BIRN to be the best way to go. We are actually proceeding with our in-house BIRN semantically-oriented efforts with a mind these standards will be specified as needed in the coming year. Where the semantic graphs are incomplete in the domains we require, we are using what appears to be the emerging formalism and filling out the graph ourselves with the expectation these can be incorporated into the community resource as it matures. As I see it, all of this work can draw on semantic web technology for many aspects of the implementation, if it can be used to construct such graphs (which it appears it can). Cheers, Bill On Jun 19, 2006, at 12:33 PM, Xiaoshu Wang wrote: > > Alan, > >>> URI http://www.example.com/gene; >>> >>> You need to dereference the "gene" variable in order to >> understand it >>> and do something meaningful about it. >> >> That's one way. You can also publish a paper that describes >> it, get a bunch of people agree to use it the same way, >> supply formal logical definitions, or a subset of them in OWL. > > The objective of semantic web is designed for use by machine for > automated > processing of information. Once it touches the social aspect, it > is beyond > what the RDF's capability, don't you think? > > The same analogy is the question regarding why we need to port > controlled > vocabulary into RDF/OWL. Because in the formal form, the semantic > is encoded > by a string of natural language, whereas the latter is by a machine > language. > >>> Answer to (1a), Of course, you can have "variables" that are not >>> intended to be dereferenced, in Java script, the type >> "undefined" is >>> similar to a "404". >>> (Please note, a 404 does not mean that the URI does not >> exist, it just >>> implies that at current time, it cannot be dereferenced.) It is not >>> wrong to define an "undefined" variable, it is just not much use of >>> it. >>> (1b) URI is just the name that refers a location on the >> WEB, so it of >>> course is a name. >> >> It is a names that *sometimes* refers to the web. See my >> quote from the RFC. > > Yes, of course. There are two basic types of information in the > web. The > information-resource (IR) and non-IR. For the former, the entitiy's > manifestation can not be retrieved through dereference the URI. For > instance, a web page, a pdf document, an RDF document etc. For the > non-IR, > like me the person, dereference the URI would not give you "me the > person". > But instead, I shall offer a description about myself at the URI that > represents me via a 303 redirect. > >> W3C knows nothing about Biology. They are good for defining >> standards, but won't help us avoid one person using a gene database >> entry identifier to refer to a protein in one place and a swissprot >> name to refer to what they mean to be the same protein in another >> place. That's what we have to work out. > > Of course, W3C won't mandate what should be a URI. But I don't > think there > should be a "standard" to say if a URI represents a biological > entity, it > should be a datebase entry or not. You can achieve this through clear > description of URI. For instance, if I declar a URI to represent a > protein > "foo". You can say > > http://www.example.com/foo a someontology:Protein . > http://www.example.com/foo http://www.example.com/dbentry (some URI to > access a dabase) . > > This is semantic clear, right? Why do we need to design a > guideline to > "implicitly" make > > http://www.example.com/foo to refer to represent certain types of > entity. I > think one of the important key to RDF is its explicitness. If you > adds a > lot of social guidelines to the RDF, the whole point of SW will be > lost. > > Xiaoshu > > Bill Bug Senior Analyst/Ontological Engineer Laboratory for Bioimaging & Anatomical Informatics www.neuroterrain.org Department of Neurobiology & Anatomy Drexel University College of Medicine 2900 Queen Lane Philadelphia, PA 19129 215 991 8430 (ph) 610 457 0443 (mobile) 215 843 9367 (fax) Please Note: I now have a new email - William.Bug@DrexelMed.edu This email and any accompanying attachments are confidential. This information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this email communication by others is strictly prohibited. If you are not the intended recipient please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.
Received on Monday, 19 June 2006 19:09:11 UTC