- From: William Bug <William.Bug@DrexelMed.edu>
- Date: Mon, 10 Jul 2006 12:29:12 -0400
- To: Phillip Lord <phillip.lord@newcastle.ac.uk>
- Cc: "w3c semweb hcls" <public-semweb-lifesci@w3.org>
- Message-Id: <2FE16280-5DC9-42AF-86CA-7FECE7DA1DE4@DrexelMed.edu>
Dear Philip, Thanks again for your thoughtful and candid comments. I'm glad you mentioned "Sonic Hedgehog." :-) I would have to disagree on the point you are making here. From the point of view of mining the literature, use of language is remarkably "messy" given the business of science. As wonderfully meaningful as Sevenless is to those who know something about Dipteran visual system anatomy and omatidia, it really is just one facet (pun intended) - and a rather high-level, complex aspect - of how the mutation in manifest in the biology. To my mind, I prefer the yeasties approach to gene/mutant names - "meaningless" strings and numbers - the same way I anally stick to exclusive use of non- colliding integers when designing RDBMS primary key attributes. The main reason I do this is in using a name, you convolve a certain implied meaning into the identifier that is: a) biased and limited in its expressiveness; b) likely to change rather quickly - especially in the domain of scientific information. Neither of these properties are appropriate for an artifact you intend to use as a unique ID. I must say I used to enjoy the playfulness in the mutation/allele/ gene naming process, but that was all before I was responsible for creating accurate systems to manage and search the information - both literature databases as well as RDBMS repositories of primary information. One of the first folks to "prick up my ears" on this topic was my neighbor, the developmental biologist Scott Gilbert. As many know. Scott authored a wonderful graduate level text on developmental biology many years back which he obviously must periodically update to keep it relevant. Each time he must do so, one of THE most daunting tasks is reviewing, disambiguating, and bringing order to the vast variety of gene/mutant/allele names and naming conventions. As you might guess, in trying to cover broad the topic of developmental biology, he must deal with 1000s of these names and with focus on the myriad of biological observations made for each figuring out how to fit all the facts associated with the names into a coherent story. He must also look to the work he'd done on this task of handling the chaotic universe of names in the previous edition of the book, adjusting his usage of terms to bring it up to date with the results of his current review of the name space. This is a task I'd not wish on my worst enemy. Sorry to belabor this topic. I expect most of us have had to deal with some aspect of this problem through the years and are quite familiar with it. As we all know, the situation has improved substantially given the knowledge resources that have been created in the last decade to bring order to this world of names. Unfortunately, usage of these resources is still rather limited when you consider the broad scope of all literature - and text annotations of data. Even IF they were more broadly adopted, there are still many ways in which the context of their use by us fallible and imprecise humans - no matter what our level of expertise - would still lead to many uncomputable ambiguities and inconsistencies. This is a topic on which Bob Futrelle has provided very useful references and thoughts here on this list. More recently, as I've had to adopt ways of formally expressing phenotype in a manner that avoids convolving/pre-coordinating entities from distinct ontological domains, I've come to realize again how problematic it can be to encode so much information in a name that is ALSO meant to provide an identifier for a specific class of biological entities be they continuants/endurants or occurrents. Naming of phenotypic mutants in mouse is a good example. Look at the MPO terms used to describe the mutant phenotype observations made on the following genotype: http://www.informatics.jax.org/searches/accession_report.cgi?id=MGI: 3038539 These are excellent terms to encapsulate the complexity of the observation made by the researchers who've been phenotyping mouse mutants for nearly a century. Tasks which focus on providing a means to capture this work and represent it for human perusal profit greatly from term lists like this. They don't map well into an ontology of the relevant knowledge domains however. This points to an obvious way in which characterizing the scientific lexicon has been extremely valuable. I know from days working at the Biological Abstracts where I was responsible for creating products someone was willing to pay money for that in comprehensively monitoring terms from the literature used to describe/label biological phenomena, you stand a much better chance of creating an accurate and comprehensive search system (mutually optimal precision & retrieval in IR vernacular) for lexical descriptions of biological phenomena - whether they appear in the literature itself or in free text descriptions applied to records in data repositories. The knowledge you accrue in this manner can be exceedingly helpful to inform the way you construct your ontology, but the lexicon must not be construed as representing the "universals" about biological entities. For instance, the complex collection of entities one might classify as being related to "glial biology" today would clearly be very different from those one would've classified under this category 30 years ago, yet the term "glial biology" was in use 30 years ago and remains in use today. Just my $0.02 on this topic. Cheers, Bill On Jul 10, 2006, at 6:42 AM, Phillip Lord wrote: > > > >>>>>> "AR" == Alan Rector <rector@cs.man.ac.uk> writes: > > AR> All > > AR> Just catching up. > > AR> Could I strongly support the following. If there is one > AR> repeatedly confirmed lesson from the medical communities > AR> experience with large terminologies/ontologies/ it is to > AR> separate the "terms" from the "entities". There are always > AR> linguistic artefacts, and language changes more fluidly in both > AR> time and space than the underlying entities. (In medical > AR> informatics this is sometimes quaintly phrased as using > AR> "nonsemantic identifiers"). > > > > Not that I wish to disagree with Alan, of course, but it is worth > mentioning the reason that so many identifiers are semantically > meaningful in biology; they look better in papers. More over, because > they have some meaning associated with them, they are likely to be > used correct in papers as biologists will notice when they have the > wrong one. > > My own feeling is that the fly people got it right years ago. Their > gene identifiers had meaning, but not too much. So, for example, > sevenless is a mutant lacking the 7th cell in the eye. Clear, straight > forward and memorable. And if the world changes under you, the name > could be left the same because it doesn't really matter that much. > > Also, some of the names were quite amusing, although the "sonic > hedgehog" gag ran out years ago. > > Cheers > > Phil > > Bill Bug Senior Analyst/Ontological Engineer Laboratory for Bioimaging & Anatomical Informatics www.neuroterrain.org Department of Neurobiology & Anatomy Drexel University College of Medicine 2900 Queen Lane Philadelphia, PA 19129 215 991 8430 (ph) 610 457 0443 (mobile) 215 843 9367 (fax) Please Note: I now have a new email - William.Bug@DrexelMed.edu This email and any accompanying attachments are confidential. This information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this email communication by others is strictly prohibited. If you are not the intended recipient please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.
Received on Monday, 10 July 2006 16:29:30 UTC