Re: [BioRDF] Taxonomic Databases Working Group and LSIDs from William Bug on 2006-08-29 (public-semweb-lifesci@w3.org from August 2006)

From: William Bug <William.Bug@DrexelMed.edu>
Date: Tue, 29 Aug 2006 07:33:57 -0500
To: Eric Neumann <eneumann@teranode.com>
Cc: "public-semweb-lifesci hcls" <public-semweb-lifesci@w3.org>
Message-Id: <26FF08DA-2840-41B6-935B-5B3CFB5FAC10@DrexelMed.edu>
Thanks for putting this out there for consideration, Eric.  I  
certainly agree the amount of effort they have invested on the issue  
of using LSIDs as GUIDs for organism taxonomic information makes they  
a very worthy example, and, as they're work continues to progress, a  
possible existence proof of the value LSIDs have to offer.

Being able to deal with species in a more systematic and semantically  
granular manner is very important - and will be critical to using  
formal semantically-driven information federation techniques to  
better support translational research - e.g., enabling the creation  
of software capable of placing findings from animal models of disease  
in their proper, fine-grained, semantic context to make them useful  
to clinical treatment of human disease.  It's also critical to  
phylogenetic analyses.  Both of these issues can be handled now with  
sufficient manual effort in a relatively narrow domain, but this is  
not scalable and not the recommended plan for the future.

In general, it is helpful to be as specific as possible when  
specifying the organism taxon, since that brings with it some  
constrained definition of genotype.  So, for instance, for the  
available digital mouse brain atlases, I believe the most specific  
one can be regarding taxon would be "Mus musculus" (ID: 10090 -  
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi? 
mode=Info&id=10090&lvl=3&lin=f&keep=1&srchmode=1&unlock), though it's  
possible the more specific subspecies Mus musculus domesticus would  
fit (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi? 
mode=Info&id=10092&lvl=3&lin=f&keep=1&srchmode=1&unlock), as many  
classical inbred strains are derived from this sub-species.

As many on this list are aware, NCBI Taxonomy is in ubiquitous use in  
the biomolecular informatics community and is included in UMLS.  
Having said that, NCBI Taxonomy is NOT the last - or even the best -  
current effort to formally specify the extent of our current extant  
knowledge of organism taxonomy and phylogeny.  In fact, every page on  
the NCBI Taxonomy site such as the ones given above includes the  
following disclaimer at the bottom of the page:
	"Disclaimer: The NCBI taxonomy database is not an authoritative  
source for nomenclature or classification - please consult the  
relevant scientific literature for the most reliable information."

The Zoological Record had previously been, since the mid-1880s, in  
cooperation with The British Museum, THE authority on this topic.   
That situation began to change in the 1990s.  The email clips below  
are from an email I'd sent a few weeks ago to a colleague in response  
to a request for info on the status of defining an agreed upon,  
global, comprehensive formal specification of organism taxonomy.

Cheers,
Bill



EMAIL 1:

To my knowledge, there are three basic projects working on the issue  
of organism taxonomy with a view toward being globally and  
phylogenetically comprehensive:
	* Life science library & info scientists associated with university  
science libraries, scientific field stations (especially for  
agriculture & ecology), life science databases (such as ZR),  
botanical gardens, and natural history museums around the world
		==> Species 2000 (http://www.sp2000.org/)
	* Researchers whose work involves some aspect of studying global  
biodiversity
		==> Global Diversity Info Facility (http://www.gbif.org/)
	* Researchers who study the phylogenetics of organism comparative  
anatomy (macro, micro, and biochemical) and behavioral ecology.
		==> Tree of Life (http://tolweb.org/tree/)

They are all authorities in their own right - Species2000, GDIF, and  
ToL - but each from their own vantage.

In some ways, the GDIF and its constituent participants has been  
around the longest, though possibly GDIF the institution hasn't been  
around as long the shared biodiversity information aggregation/ 
integration effort started by several of the participant organizations.

GDIF
Homepage
	http://www.gbif.org/
Wiki
	http://wiki.gbif.org/gbif/wikka.php?wakka=HomePage
Portal
	http://www.asia.gbif.net/portal/index.jsp

Darwin Core data element definitions
	http://darwincore.calacademy.org

When you go to the Portal and browse the taxonomy, you see  
attribution to sources for taxonomic names.  This appears to be in  
holding with the following stated goal:
	"Taxonomic names. GBIF developing an 'Electronic Catalogue of  
Taxonomic Names'. This will provide access to authoritative  
information about both scientific and common names for all organisms,  
and will integrate data from a wide range of different organisations.  
The portal already includes data for over 983,000 scientific names  
and 253,000 common names from the Catalogue of Life Partnership  
Annual Checklist. Some names are listed with the words 'Tentative  
position in taxonomy'. This indicates that the name is only known to  
the portal from specimen/observation records and should not be  
treated as authoritative simply on the basis of being listed here."

Right now they have 176 data providers for taxonomic information  
(http://www.gbif.org/DataProviders/providerslist?sortby=records),  
many of which are linked to the Species2000 Project.

I also know GDIF has been looking to use semantic web technologies in  
a big way and the LSID as a global identification system (resolvable  
URIs for RDF triplet resources).

The Tree of Life has always appealed to me as a bottom up effort of  
current investigators whose research aims include a phylogenetics  
component.  It was the "brain child" of the Maddison brothers (http:// 
tolweb.org/tree/home.pages/homepeople.html) back in the mid-90s.   
I've known of ToL since it's relatively humble beginnings about a  
decade ago as a collection of phylogeny web pages organized according  
in a phylogenetic tree graph.  Back then, there were mostly empty  
nodes in the graph.  Now they have an absolutely immense collection  
of domain expert contributors and an ever decreasing collection of  
blank nodes (http://tolweb.org/tree/home.pages/participants.html).   
Given the participants involved and their stated objectives (http:// 
tolweb.org/tree/home.pages/goals.html) there efforts on this task  
really need to be somehow incorporated into any comprehensive,  
semantically formal expression of organism taxonomy.



EMAIL 2:

I do think this is a critical issue for the medium- to long-term.  My  
sense has been about 10 years ago NCBI bit off the tractable part of  
this problem immediately addressing the needs of molecular biologists  
in a manner that has proven exceedingly useful along the lines of the  
the way GO has become a ubiquitous tool for many informatics tasks  
stretching well beyond it's original design goals - though in the  
area of microbes, and particularly viruses, there are significant  
problems with NCBI.  Whenever such a thing happens - a tool gets  
pressed into service for tasks not part of it's original cornucopia  
of Use Cases - there is a need to step back.  Either you need to  
start recommending the community not use the resource for that "new"  
purpose - as is often the case for UMLS utilization - or considerable  
re-tooling needs to be done.

The biodiversity group includes folks like ZR and the various nat.  
history/bot. gardens organizations throughout the world, etc.  who've  
been working on this issue of organism taxonomy for a very long time  
- some for over a century.  Few have resources you'd want to use "as  
is" if the goal were to construct a well founded ontology.  I'm  
particularly concerned with the high-level structure of the  
"ontology" the TDWG is proposing (the DARWIN Core - http:// 
darwincore.calacademy.org/).  However, it is really ill advised to go  
it alone and ignore this body of work.
	
NCBI taxonomy - like GO - is in such ubiquitous use in the realm of  
molecular & celluar biology, one can't throw it out either.  Really  
what should be done is those at NCBI who curate NCBI Tax., the GBIF  
folks, AND the Tree of Life folks need to be brought together to work  
on this problem.  Otherwise, splintering of the efforts  will cause  
problems for us all in the future.




On Aug 28, 2006, at 1:02 PM, Eric Neumann wrote:

> I would like to point out the Taxonomic Databases Working Group  
> (TDWG) and their work with trying to establish a system of Global  
> Unique Identifiers (GUIDs).
>
> http://wiki.gbif.org/guidwiki/wikka.php?wakka=GUID2Report
>
> At this point in time they are recommending (within their  
> community) the use of LSIDs WITH metadata in the form of RDF.
>
> I would like to propose that we include this on the list of  
> examples for the LSID/URI discussion in BioRDF (just added to  
> http://esw.w3.org/topic/HCLSIG_BioRDF_Subgroup/Tasks/ 
> URI_Best_Practices/LSID_Pros_%26_Cons). I think they have some  
> great global examples of how to use such identifiers.
>
> Eric
>
>
> Eric Neumann, PhD
> co-chair, W3C Healthcare and Life Sciences,
> and Senior Director Product Strategy
> Teranode Corporation
> 83 South King Street, Suite 800
> Seattle, WA 98104
> +1 (781)856-9132
> www.teranode.com
>

Bill Bug
Senior Research Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - William.Bug@DrexelMed.edu







This email and any accompanying attachments are confidential. 
This information is intended solely for the use of the individual 
to whom it is addressed. Any review, disclosure, copying, 
distribution, or use of this email communication by others is strictly 
prohibited. If you are not the intended recipient please notify us 
immediately by returning this message to the sender and delete 
all copies. Thank you for your cooperation.
Received on Tuesday, 29 August 2006 12:34:32 UTC