Re: Size estimates of current LS space (and Introductions) from Jeremy Zucker on 2006-08-02 (public-semweb-lifesci@w3.org from August 2006)

From: Jeremy Zucker <zucker@research.dfci.harvard.edu>
Date: Wed, 2 Aug 2006 05:28:06 -0400
To: Jeremy Zucker <zucker@research.dfci.harvard.edu>
Cc: "Skinner, Karen ((NIH/NIDA)) [E]" <kskinner@nida.nih.gov>, "Eric Neumann" <eneumann@teranode.com>, "public-semweb-lifesci hcls" <public-semweb-lifesci@w3.org>
Message-Id: <44CB6BEE-46F8-4D43-A668-17E7B730D972@research.dfci.harvard.edu>

Hello folks,

It appears that I forgot to put the URL in that last email about the  
pathguide:  http://www.pathguide.org

Well, since I've already managed to  embarrass myself publicly, I  
figure I might as well introduce myself formally.

My name is Jeremy Zucker, and I am a bioinformatics specialist at the  
Dana-Farber Cancer Institute and a research fellow at Harvard Medical  
School in George Church's lab.
I have been working mainly with data integration[1] issues that arise  
from automating the metabolic reconstruction of  pathway/genome  
databases [2] for the purpose of generating flux balance models [3].   
I also work with Joanne Luciano and others on a pathway exchange  
format in OWL/RDF called BioPAX.

The semantic web interests me for several reasons.  For one, I  
believe it will be a solid substrate for distributed curation, which  
is a necessary part of the ongoing effort to improve the quality of  
the biological data we use.
Like wikipedia, we need a way to exploit the wisdom of crowds to  
discover, cross-validate, and annotate the biological data that we  
are currently using.

Second,   the semantic web should make it easier to do distributed  
"pathway data mashups", such as overlaying expression data onto  
metabolic, signal transduction, and gene regulation pathways, to  
understand how the cell controls the production of itself, how  
certain disease states form,  how to alter metabolic pathways to  
remove toxins from the environment, and how  to optimize the  
metabolic fluxes to produce useful biomolecules.

Third, with semantic web technologies such as description logics and  
rules, it should be possible to infer when two data sets are really  
talking about the same biological object, even if they use different  
identifiers to describe the thing.
To that end, I have been working with Alan Ruttenberg and others at  
York University, UCSD and SRI to develop an OWL/Description-logic  
based method to automate the integration of two E. coli databases.

The first database has an extremely well-developed ontology [2].  The  
other has a highly curated data set specifically tuned for flux  
balance analysis [3]. By merging them, it should be possible to  
automatically generate metabolic flux models for any sequenced  
organism.[4]

There, now that I have introduced myself and my interests,  let's try  
to estimate the number of javabeans in the Life sciences jar!

Sincerely,

Jeremy

[1] http://www.freebiology.org/wiki/Debugging_the_bug
[2] http://biocyc.org
[3] http://gcrg.ucsd.edu
[4] http://prelude.bu.edu/publications/Segre_etal_OMICS_2003.pdf

On Aug 2, 2006, at 1:17 AM, Jeremy Zucker wrote:

>
> Hi folks,
>
>   One resource that is likely to be of use in the pathway space is  
> the pathguide:
> It has detailed statistics about the size of each database and  
> other  metadata for about 222 biological pathway databases.
> This is the target space for conversion to BioPAX.
>
> Sincerely,
>
> Jeremy
>
>
>
> On Jul 31, 2006, at 6:35 PM, Skinner, Karen ((NIH/NIDA)) [E] wrote:
>
>>
>> These may be helpful resources:
>>
>> The Nucleic Acids Research Public Links Directory
>> See:
>> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? 
>> db=pubmed&cmd=Retrieve&dop
>> t=AbstractPlus&list_uids=16845014&query_hl=6&itool=pubmed_docsum
>>
>>
>> And the Nucleic Acids 2006 Molecular Biology Database Collection
>> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? 
>> itool=abstractplus&db=pubm
>> ed&cmd=Retrieve&dopt=abstractplus&list_uids=16381871
>>
>> Karen Skinner, Ph.D.
>> Deputy Director for Science and Technology Development
>> Division of Basic Neuroscience and Behavior Research
>> National Institute on Drug Abuse
>> Room 4243
>> 6001 Executive Boulevard
>> Bethesda, Maryland 20892-9651
>> 301-435-0886 or 301-443-1887
>> ks79x@nih.gov
>>
>>
>> -----Original Message-----
>> From: Eric Neumann [mailto:eneumann@teranode.com]
>> Sent: Monday, July 31, 2006 10:07 AM
>> To: public-semweb-lifesci hcls
>> Subject: Size estimates of current LS space
>>
>>
>>
>> As per today's Telcon, does any person with genomics knowledge (that
>> includes you too Carole) have estimates for the following numbers:
>>
>> 1. How many bio-molecular and organism-anatomical-functional entities
>> and records (broad sense) are currently accessible through the web
>> (excluding LIMS entities, such as samples, for now)?
>>
>> 2. Does this number grow substantially when it is allowed to include
>> every variant of protein, gene, etc. per species (i.e., not  
>> instances of
>> real molecules or organisms)?
>>
>>
>> I think these would be quite useful for other W3C members to be aware
>> of, since some proposed mechanisms would require their global
>> indexing...
>>
>> Eric
>>
>>
>>
>
>

Received on Wednesday, 2 August 2006 09:28:28 UTC