Re: URIs from William Bug on 2006-06-19 (public-semweb-lifesci@w3.org from June 2006)

From: William Bug <William.Bug@DrexelMed.edu>
Date: Mon, 19 Jun 2006 15:08:49 -0400
To: "Xiaoshu Wang" <wangxiao@musc.edu>
Cc: "'Alan Ruttenberg'" <alanruttenberg@gmail.com>, <public-semweb-lifesci@w3.org>
Message-Id: <20D9369C-D985-4E0D-A130-498FB3258196@DrexelMed.edu>
Hi All,

First, I'd like to recommend two articles I believe are very relevant  
to this discussion and may help provide us a clearer sense of how to  
proceed here:

	1) X. Wang, Robert Gorlitsky, and Jonas S Almeida, From XML to RDF:  
how semantic web technologies will change the design of 'omic'  
standards (2005) Nat. Biotech., v23, n9, p1099
			(Xiaoshu is first author on this - I know this may be bringing  
"coals to Newcastle," but if there are some on this list who have not  
read this article, I'd strongly recommend they do.)

	2) G.V. Gkoutos, E. C. J. Green, S. Greenway, A. Blank, A.-M.  
Mallon, J.M. Hancock, CRAVE: A database, middleware, and  
visualizaiton system for phenotype ontologies (2005) Bioinformatics,  
v21, n7, p1257.

I'll explain below where I see these fitting in to this discussion.

I would maintain the "social" issue we are addressing here is the  
shared, community view of the lower-levels of the semantic graph,  
which is very much related to what we need to have the machine  
algorithms parse.

I would also maintain the community efforts to produce a "formal  
form" (of the semantics) in the relevant domains of biomedical  
knowledge are very much separate from research fields focused on  
analyzing natural language expressions.  A "string of natural  
language" is an instance of a lexical "view" of A formal form of  
semantics - the formal semantic graph existing (somewhere) in the  
author(s) brain(s) - which may or may not conform to the shared  
formal semantic frameworks being developed by the community to cover  
specific knowledge domains in biomedicine.  RDF is particularly good  
at providing a formal way of making explicit the many semantic  
relations (explicit and implied) in a phrase of natural language, but  
that doesn't mean when we talk about semantic representation in RDF,  
we are always talking about representing natural language  
expressions.  Dealing with natural language is critical when parsing  
meaning from existing scientific articles, but here I believe we are  
more focussed on coming up with a means by which we can specify/ 
identify the semantic entities related to instances in data  
repositories, as opposed to only dealing with that which is  
extrapolated from parsing the literature.

It is very important not to confuse the lexicon with an ontology.   
The way I like to think of it, the lexicon is to an ontology, as a  
SQL VIEW is to your core data model.  The "view" contains a subset of  
the relational content consistent with the more complete abstract  
model, but the goal of the view is to interface to a particular  
application requirement, and thereby makes some compromises that can  
be very application specific, and not necessarily reflect the  
underlying model assertions.

This is not just semantics.  ;-)

As several people on this list could easily expound on much better  
than I, there is a world of difference between the computational  
linguistic fields focussed on deducing semantic content from natural  
language strings and the formal ontological efforts to derive a  
foundational semantic framework for biomedical KR.  One would hope  
the Knowledge Extraction process performed by the computational  
linguists can be made to converge with (or used to re-architect when  
necessary) the shared community semantic graph, but the two are  
certainly not synonymous.

I would agree with what I believe Alan pointed out - it is a very  
complex issue to resolve the difference between the semantics  
associated with a particular data instance (e.g., a somatically re- 
combined sequence in a specific patient that lead through a very  
complex biochemical & morphogenetic process to a specific neoplasm)  
to the related, higher-level shared semantic descriptions (e.g., of  
the gene in which that mutation took place).  I don't know we can  
expect to resolve that issue given our current limited scope.  I do  
feel its a critical issue to the overall goal of using semantic web  
descriptions of resources (including primary data) to drive new  
knowledge discovery.

As I mentioned in the phone call, one of the ways these issues can be  
resolved is via the use of semantically-based mediation technology.   
As someone else pointed out in the call, it is really untenable to  
attempt to warehouse all the data needed for field wide, higher order  
data repositories beyond the biomolecular.  In many ways, even in the  
biomolecular domain, we've outgrown warehousing.  PDB, SwissProt,  
GENBANK - a great deal of bioinformatics work that focusses on  
content in these warehouses is targeting integration across  
repositories and links to other, newer emerging repositories.  In  
many ways, this is a tasks semantic web tech is most suited to  
support (see the paper by Xiaoshu cited above).  To make this work,  
mediation technology requires an alignment of participating  
repository schemas.  This can rarely be done effectively without  
referring to some shared, community semantic framework.  Of course,  
this also requires, as Xiaoshu has pointed out on this thread, a more  
explicit statement of the "processing contract" - an extremely thorny  
issue you cannot avoid when you are actually trying to do something  
with RDF content.

Requirement for semantic data mediation is definitely required in the  
neuro-domain.  In the BIRN project, we have a data mediator used to  
link across the 40+ disparate research lab repositories of primary  
and reduced data (Luis Marenco, Gordon, Perry Miller, and Kei are  
also developing a mediator framework at Yale, I believe).  There is a  
BIRN mediator "registration" process each participant lab needs to go  
through to link their data to the mediator. The goal is the mediator  
would resolve queries made to the BIRN portal into sub-queries across  
the participating "registered" databases.  Though initially only  
minimally tied to a semantic description of the source repositories,  
it's now clear a critical part of this process is for the source  
databases to map the entities in their schema to the appropriate  
higher-level, semantic entities to which they refer, drawing on a  
shared semantic framework for the domain of neurodegerative disease  
being developed by the BIRN Ontology Task Force (I am a member of the  
BIRN OTF).

In the course of developing the BIRN shared semantic framework, we've  
begun to establish a set of "best practices" at least for what we are  
doing within BIRN that appears to be specifically applicable to this  
topic under discussion:
	1) Re-use existing knowledge resources whenever possible.  This  
extends from flat term lists (gene names), to integrated lexical  
graph indexes such as NeuroNames, on through to formally complete  
ontologies such as the Foundational Model of Anatomy (FMA).  We are  
rarely able to use these "as is", yet it is clear by making the  
effort to examine where we need to adapt these resources, we expend  
nearly an order of magnitude less resources than were we to refashion  
the resource from first principles ourselves.  Often you can still  
use a given domain ontology's formalism, even when the ontology  
itself doesn't provide you with the requisite granularity you  
require.  Using the same - or a compatible formalism - at least holds  
out the possibility of later integrating what you create into the  
community resource.
	2) Most all of the semantic information we expect to expose to the  
mediator can be reduced to an elemental view - that of measurements  
made in the course of an investigation meant to specify  
(quantitatively or qualitatively) phenotypic traits.  This is true of  
spatially-mapped, CNS gene or protein expression data (e.g., Allen  
Brain Atlas, GENSAT, Desmond Smiths "voxelized" microarray data  
sets), as well as for assays of behavior and cognition which pervade  
the human focussed, neuroimaging projects within BIRN.  With this in  
mind, we came to the understanding that it is important:
		a) to use a shared foundational ontology (we are trying to use the  
BFO model beginning to be adopted by many biomed ontology efforts -  
e.g., FuGO, FMA) and a community-shared, collection of semantic  
relations (again - we are converging on the OBO Relations ontology -  
http://obo.sourceforge.net/relationship/ - another article worth  
reading)
		b) to develop a means of formal phenotypic attribute description  
more flexible and capable of evolving than the current approaches in  
use by the community - e.g., use of the Mammalian Phenotype Ontology  
by the GO folks at the Jackson Labs (http://www.informatics.jax.org/ 
menus/vocab_menu.shtml).  These "pre-coordinated" views of complex  
knowledge domains are very useful when you are providing a user  
interface for human literature curators (as GO & MGI do with MPO),  
but they don't provide for algorithms re-combining the more elemental  
semantic aspects represented in these "flattened" views.  Both in the  
case of disease and phenotype in general there is also a need to be  
more specifically tied to the observations extrapolated from the  
primary data.  This is where the second citation above comes in  
(CRAVE - application using PATO).  Using PATO with FuGO (once FuGO  
spreads to cover assays, devices, & reagents outside of gene &  
protein expression, as it is gradually moving toward), one can build  
a semantically well defined, description of phenotype maintaining the  
integrity of the semantic links both to the primary data AND to the  
shared, higher-level, semantic frameworks in the community.

It's important to note many of these efforts - use of BFO, FuGO,  
PATO, etc. - are really quite new.  PATO itself is so new, it's  
definition/specification is a bit of a moving target.

Having said this, using the general approach outlined above takes  
into account many hard learned lessons accumulated over the last few  
decades in the field of biomedical KR.  It also appears from our  
vantage within BIRN to be the best way to go.  We are actually  
proceeding with our in-house BIRN semantically-oriented efforts with  
a mind these standards will be specified as needed in the coming  
year.  Where the semantic graphs are incomplete in the domains we  
require, we are using what appears to be the emerging formalism and  
filling out the graph ourselves with the expectation these can be  
incorporated into the community resource as it matures.

As I see it, all of this work can draw on semantic web technology for  
many aspects of the implementation, if it can be used to construct  
such graphs (which it appears it can).

Cheers,
Bill


On Jun 19, 2006, at 12:33 PM, Xiaoshu Wang wrote:

>
> Alan,
>
>>> URI http://www.example.com/gene;
>>>
>>> You need to dereference the "gene" variable in order to
>> understand it
>>> and do something meaningful about it.
>>
>> That's one way. You can also publish a paper that describes
>> it, get a bunch of people agree to use it the same way,
>> supply formal logical definitions, or a subset of them in OWL.
>
> The objective of semantic web is designed for use by machine for  
> automated
> processing of information.  Once it touches the social aspect, it  
> is beyond
> what the RDF's capability, don't you think?
>
> The same analogy is the question regarding why we need to port  
> controlled
> vocabulary into RDF/OWL. Because in the formal form, the semantic  
> is encoded
> by a string of natural language, whereas the latter is by a machine
> language.
>
>>> Answer to (1a), Of course, you can have "variables" that are not
>>> intended to be dereferenced, in Java script, the type
>> "undefined" is
>>> similar to a "404".
>>> (Please note, a 404 does not mean that the URI does not
>> exist, it just
>>> implies that at current time, it cannot be dereferenced.) It is not
>>> wrong to define an "undefined" variable, it is just not much use of
>>> it.
>>> (1b) URI is just the name that refers a location on the
>> WEB, so it of
>>> course is a name.
>>
>> It is a names that *sometimes* refers to the web. See my
>> quote from the RFC.
>
> Yes, of course.  There are two basic types of information in the  
> web.  The
> information-resource (IR) and non-IR.  For the former, the entitiy's
> manifestation can not be retrieved through dereference the URI.  For
> instance, a web page, a pdf document, an RDF document etc.  For the  
> non-IR,
> like me the person, dereference the URI would not give you "me the  
> person".
> But instead, I shall offer a description about myself at the URI that
> represents me via a 303 redirect.
>
>> W3C knows nothing about Biology. They are good for defining
>> standards, but won't help us avoid one person using a gene database
>> entry identifier to refer to a protein in one place and a swissprot
>> name to refer to what they mean to be the same protein in another
>> place. That's what we have to work out.
>
> Of course, W3C won't mandate what should be a URI.  But I don't  
> think there
> should be a "standard" to say if a URI represents a biological  
> entity, it
> should be a datebase entry or not.  You can achieve this through clear
> description of URI. For instance, if I declar a URI to represent a  
> protein
> "foo". You can say
>
> http://www.example.com/foo a someontology:Protein .
> http://www.example.com/foo http://www.example.com/dbentry (some URI to
> access a dabase) .
>
> This is semantic clear, right?  Why do we need to design a  
> guideline to
> "implicitly" make
>
> http://www.example.com/foo to refer to represent certain types of  
> entity.  I
> think one of the important key to RDF is its explicitness.  If you  
> adds a
> lot of social guidelines to the RDF, the whole point of SW will be  
> lost.
>
> Xiaoshu
>
>

Bill Bug
Senior Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - William.Bug@DrexelMed.edu







This email and any accompanying attachments are confidential. 
This information is intended solely for the use of the individual 
to whom it is addressed. Any review, disclosure, copying, 
distribution, or use of this email communication by others is strictly 
prohibited. If you are not the intended recipient please notify us 
immediately by returning this message to the sender and delete 
all copies. Thank you for your cooperation.
Received on Monday, 19 June 2006 19:09:11 UTC