Re: scientific publishing task force update from William Bug on 2006-06-14 (public-semweb-lifesci@w3.org from June 2006)

From: William Bug <William.Bug@DrexelMed.edu>
Date: Wed, 14 Jun 2006 11:02:39 -0400
To: kc28 <kei.cheung@yale.edu>
Cc: John Rumble <jumbleusa@earthlink.net>, Phillip Lord <phillip.lord@newcastle.ac.uk>, Eric Neumann <eneumann@teranode.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>, Jack Park <jack.park@sri.com>
Message-Id: <3877E590-36CA-4AF4-80DF-102A8F277604@DrexelMed.edu>
By all means - versioning is crucial - and all knowledge maps/ 
association files/annotations referencing nodes in an ontology MUST  
include the version number.

For an example of how biomed. ontology curators deal with the issue  
of versioning, see the Gene Ontology Consortium web site pages  
describing their SOP on this issue:

http://www.geneontology.org/GO.usage.shtml#obsoletions
	see "Obsoleting Terms" and "Merges, Splits, Movements"

All of the OBO Foundary ontologies are set up in a source control  
system, have an official "release" policy, and associated mailing- 
lists to request changes/corrections and announce new releases.

Generally as is the case with evolving software API & format specs,  
new versions are backward compatible - e.g., annotations citing terms/ 
concepts/entities/nodes as they existed in a previous graph, can be  
resolved in the more recent versions.  However, since we are talking  
about "theories of reality" here - and as many have pointed out, our  
descriptions of reality evolve in often non-monotonic ways, the  
mapping across versions from a node in one version to the  
"equivalent" node(s) in other versions may be a far from trivial  
process.  Sometimes the mappings can be giving using DL rules, others  
simply require a deterministic look-up table.

The curation process of deciding how to "migrate" nodes as changes/ 
corrections are required can be quite complex, as can be seen when  
you review what GO curators are required to do to keep knowledge maps/ 
association files current when they originally referenced nodes in  
prior versions of GO (see refs above).

I realize this may sound hideously complex, very labor intensive, and  
"fragile", but the process actually works.

Here, too, I think its important to remember what the original  
requirements are for a given knowledge resource.  In the case of GO,  
the core curation process has focussed on mapping occurrences of  
specific biomolecular and subcellular entities as they occur in the  
literature.  A significant portion of the GO curation process still  
revolves around explicitly tracking entity occurrences in the  
literature.

Of course, a whole slew of powerful tools and valuable research has  
grown around GO - especially as its formal specificity has improved  
over the last 5 years or so, many of which are designed to use GO to  
organize/pool/analyze primary research data, as opposed to focusing  
on it's "representation" in the literature.  I think this is where  
ontology practicing is most likely to provide the greatest benefit in  
the coming decade - as applied to primary data repositories.  It is  
here, too, where Semantic Web technologies are most likely to  
relevant and provide a powerful, flexible formalism for representing  
semantic info associated with scientific observations - with explicit  
links to various knowledge resources across the formal semantic  
spectrum (from flat term lists through, thorough, computable and  
relatively complete theories of reality).

The following two threads of activity in biomedical KR are important  
to understand as related, yet distinct threads of activity:

	1) KR applied to existing descriptions of research data:
		From repositories of primary data such as GENBANK and GEO on  
through the highly reduced representations found in the STM  
literature.  Analysis of the semantic and lexical content of the  
later have been going on since the 1940s & 1950s (at least in the  
info/library science fields) and more recently (since the 1960s) in  
the converging C.S./Linguistics fields (e.g., Comp. Linguistics and  
Info Retrieval). Only in the last 15 years have ontologies played any  
significant role in these pursuits.  TextPresso - the text mining  
framework recommended by the model organism database consoritum  
(http://www.gmod.org/home & http://www.textpresso.org/) is a good  
example of this approach coming from the bioinformatics community,  
but there are other examples using much more powerful Comp.  
Linguistic techniques.

	2) Use in creating NEW descriptions of primary data:
		Here, ontologies along with SW tech and other KR tools (such as the  
Topic Maps Reference Model (TMRM) Jack Park and his colleagues at SRI  
are working on) and C.S. techniques for  federating inter-related  
data repositories can be combined to transform our ability to compute  
across large swarths of data.  In this case, the first digital  
representation of research data derives from a formally sound,  
computable framework.  It is this latter approach, combined with the  
armamentarium of informatics tools accumulated over the last 30 years  
from various fields, that will bring the bulk of biomedical  
researchers forward from the still 19th approach to forcing all  
contributions to the evolving biomed knowledge base to pass through a  
human brain for knowledge extraction to one where human cognitive  
capacity is truly being augmented via automation (in the sense  
espoused by Doug Englebart and Vanevar Bush) and all new scientific  
descriptions can be automatically analyzed in the context of all  
relevant prior knowledge.  I consider this transformation very much  
like the that has taken place over the last 30 years to augment our  
tools for observation (automated, high throughput sequencing;  
molecular imaging and all forms of microscopy; microarrays; etc.)

I think some of the disagreement/confusion on the topic of the  
accuracy and effectiveness of biomedical ontologies derives from  
collapsing these two approaches to KR, which though highly inter- 
related, bring with them distinct approaches, limits, caveats, and  
capabilities.

Just my $0.02.

Cheers,
Bill

On Jun 13, 2006, at 10:00 PM, kc28 wrote:

>
> This brings up an interesting issue -- how ontological evolution  
> would impact mapping or integration of overlapping ontologies. I  
> believe it's quite a research challenge. We might need to  
> incorporate the notion of versioning into the ontological  
> structure. For example, what versions of the protein classes/ 
> instances can be mapped between two ontologies. Just my two-cent  
> thought.
>
> Cheers,
>
> -Kei
>
> John Rumble wrote:
>
>> An unwritten rule about higher level ontologies is that they  
>> reflect our knowledge today, not tomorrow. As knowledge evolves,  
>> the upper level ontologies, especially, must also evolve. The  
>> example of the concept "protein" is very apropos here. We can view  
>> it from  functional, structural, integrative angles, and I am sure  
>> there are a bunch more. Then think about how our "concept" of a  
>> protein in each of those views has evolved over the last 10 years,  
>> 20 years, 75 years. The problem is evident.
>>  At whatever level an ontology is developed, someone smarter or  
>> with more insight or standing on the shoulder of giants will use  
>> that onotlogy as a building block for a new and better higher  
>> level view of nature. We have not reached the end of science yet.
>>  In my days of leading similar standards developments, some of the  
>> best progress we made was when we banned discussions of (1) higher- 
>> level ontologies (though we called them something else back in  
>> those old days) and (2) acronyms.
>>  For those of you who have requested more references on my  
>> previous e-mail about experiment description, it will have to wait  
>> a few more days. Unfortunately bioinformatics have not solved my  
>> kidney stone issues, which severely limit my ability to pull the  
>> requested information together.
>>  John
>>  Dr. John Rumble
>> Technical Director
>> Information International Associates
>> Oak Ridge TN
>> www.infointl.com <http://www.infointl.com>
>> jrumble@iiaweb.com <mailto:jrumble@iiaweb.com>
>> jumbleusa@earthlink.net <mailto:jumbleusa@earthlink.net>
>> 301 963 7903 (Home Office)
>> 301 502 5729 (Cell)
>> 865 298 1251 (Oak Ridge Office)
>
>
>

Bill Bug
Senior Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - William.Bug@DrexelMed.edu







This email and any accompanying attachments are confidential. 
This information is intended solely for the use of the individual 
to whom it is addressed. Any review, disclosure, copying, 
distribution, or use of this email communication by others is strictly 
prohibited. If you are not the intended recipient please notify us 
immediately by returning this message to the sender and delete 
all copies. Thank you for your cooperation.
Received on Wednesday, 14 June 2006 15:03:07 UTC