Re: ontology specs for self-publishing experiment from William Bug on 2006-07-10 (public-semweb-lifesci@w3.org from July 2006)

From: William Bug <William.Bug@DrexelMed.edu>
Date: Mon, 10 Jul 2006 12:29:12 -0400
To: Phillip Lord <phillip.lord@newcastle.ac.uk>
Cc: "w3c semweb hcls" <public-semweb-lifesci@w3.org>
Message-Id: <2FE16280-5DC9-42AF-86CA-7FECE7DA1DE4@DrexelMed.edu>
Dear Philip,

Thanks again for your thoughtful and candid comments.

I'm glad you mentioned "Sonic Hedgehog."  :-)

I would have to disagree on the point you are making here.

 From the point of view of mining the literature, use of language is  
remarkably "messy" given the business of science.

As wonderfully meaningful as Sevenless is to those who know something  
about Dipteran visual system anatomy and omatidia, it really is just  
one facet (pun intended) - and a rather high-level, complex aspect -  
of how the mutation in manifest in the biology.  To my mind, I prefer  
the yeasties approach to gene/mutant names - "meaningless" strings  
and numbers - the same way I anally stick to exclusive use of non- 
colliding integers when designing RDBMS primary key attributes.  The  
main reason I do this is in using a name, you convolve a certain  
implied meaning into the identifier that is:
	a) biased and limited in its expressiveness;
	b) likely to change rather quickly - especially in the domain of  
scientific information.
Neither of these properties are appropriate for an artifact you  
intend to use as a unique ID.

I must say I used to enjoy the playfulness in the mutation/allele/ 
gene naming process, but that was all before I was responsible for  
creating accurate systems to manage and search the information - both  
literature databases as well as RDBMS repositories of primary  
information.

One of the first folks to "prick up my ears" on this topic was my  
neighbor, the developmental biologist Scott Gilbert.  As many know.  
Scott authored a wonderful graduate level text on developmental  
biology many years back which he obviously must periodically update  
to keep it relevant.  Each time he must do so, one of THE most  
daunting tasks is reviewing, disambiguating, and bringing order to  
the vast variety of gene/mutant/allele names and naming conventions.   
As you might guess, in trying to cover broad the topic of  
developmental biology, he must deal with 1000s of these names and  
with focus on the myriad of biological observations made for each  
figuring out how to fit all the facts associated with the names into  
a coherent story.  He must also look to the work he'd done on this  
task of handling the chaotic universe of names in the previous  
edition of the book, adjusting his usage of terms to bring it up to  
date with the results of his current review of the name space.  This  
is a task I'd not wish on my worst enemy.

Sorry to belabor this topic.  I expect most of us have had to deal  
with some aspect of this problem through the years and are quite  
familiar with it.  As we all know, the situation has improved  
substantially given the knowledge resources that have been created in  
the last decade to bring order to this world of names.   
Unfortunately, usage of these resources is still rather limited when  
you consider the broad scope of all literature - and text annotations  
of data.  Even IF they were more broadly adopted, there are still  
many ways in which the context of their use by us fallible and  
imprecise humans - no matter what our level of expertise - would  
still lead to many uncomputable ambiguities and inconsistencies.   
This is a topic on which Bob Futrelle has provided very useful  
references and thoughts here on this list.

More recently, as I've had to adopt ways of formally expressing  
phenotype in a manner that avoids convolving/pre-coordinating  
entities from distinct ontological domains, I've come to realize  
again how problematic it can be to encode so much information in a  
name that is ALSO meant to provide an identifier for a specific class  
of biological entities be they continuants/endurants or occurrents.   
Naming of phenotypic mutants in mouse is a good example.  Look at the  
MPO terms used to describe the mutant phenotype observations made on  
the following genotype:

http://www.informatics.jax.org/searches/accession_report.cgi?id=MGI: 
3038539

These are excellent terms to encapsulate the complexity of the  
observation made by the researchers who've been phenotyping mouse  
mutants for nearly a century.  Tasks which focus on providing a means  
to capture this work and represent it for human perusal profit  
greatly from term lists like this.  They don't map well into an  
ontology of the relevant knowledge domains however.

This points to an obvious way in which characterizing the scientific  
lexicon has been extremely valuable.  I know from days working at the  
Biological Abstracts where I was responsible for creating products  
someone was willing to pay money for that in comprehensively  
monitoring terms from the literature used to describe/label  
biological phenomena, you stand a much better chance of creating an  
accurate and comprehensive search system (mutually optimal precision  
& retrieval in IR vernacular) for lexical descriptions of biological  
phenomena - whether they appear in the literature itself or in free  
text descriptions applied to records in data repositories.

The knowledge you accrue in this manner can be exceedingly helpful to  
inform the way you construct your ontology, but the lexicon must not  
be construed as representing the "universals" about biological  
entities.  For instance, the complex collection of entities one might  
classify as being related to "glial biology" today would clearly be  
very different from those one would've classified under this category  
30 years ago, yet the term "glial biology" was in use 30 years ago  
and remains in use today.

Just my $0.02 on this topic.

Cheers,
Bill


On Jul 10, 2006, at 6:42 AM, Phillip Lord wrote:

>
>
>
>>>>>> "AR" == Alan Rector <rector@cs.man.ac.uk> writes:
>
>   AR> All
>
>   AR> Just catching up.
>
>   AR> Could I strongly support the following.  If there is one
>   AR> repeatedly confirmed lesson from the medical communities
>   AR> experience with large terminologies/ontologies/ it is to
>   AR> separate the "terms" from the "entities".  There are always
>   AR> linguistic artefacts, and language changes more fluidly in both
>   AR> time and space than the underlying entities.  (In medical
>   AR> informatics this is sometimes quaintly phrased as using
>   AR> "nonsemantic identifiers").
>
>
>
> Not that I wish to disagree with Alan, of course, but it is worth
> mentioning the reason that so many identifiers are semantically
> meaningful in biology; they look better in papers. More over, because
> they have some meaning associated with them, they are likely to be
> used correct in papers as biologists will notice when they have the
> wrong one.
>
> My own feeling is that the fly people got it right years ago. Their
> gene identifiers had meaning, but not too much. So, for example,
> sevenless is a mutant lacking the 7th cell in the eye. Clear, straight
> forward and memorable. And if the world changes under you, the name
> could be left the same because it doesn't really matter that much.
>
> Also, some of the names were quite amusing, although the "sonic
> hedgehog" gag ran out years ago.
>
> Cheers
>
> Phil
>
>

Bill Bug
Senior Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - William.Bug@DrexelMed.edu







This email and any accompanying attachments are confidential. 
This information is intended solely for the use of the individual 
to whom it is addressed. Any review, disclosure, copying, 
distribution, or use of this email communication by others is strictly 
prohibited. If you are not the intended recipient please notify us 
immediately by returning this message to the sender and delete 
all copies. Thank you for your cooperation.
Received on Monday, 10 July 2006 16:29:30 UTC