Re: protein identification problem from Melissa Cline on 2005-07-01 (public-semweb-lifesci@w3.org from July 2005)

From: Melissa Cline <cline@pasteur.fr>
Date: Fri, 01 Jul 2005 09:45:20 +0200
To: public-semweb-lifesci@w3.org
Message-Id: <1120203920.18594.48.camel@cumin.sysbio.pasteur.fr>
Hi all,

The ID resolution problem is indeed a great opportunity for semantic web
technologies, since as both Gary and Ken allude to, we often have
different semantic levels at play simultaneously.  For instance, when we
say that EGF interacts with EGFR, we really mean that a protein produced
by EGF interacts with one produced by EGFR.  In practice, both EGF and
EGFR produce a variety of splice variants, and not all of their products
will interact.  So for most accuracy, the interaction should be
described by protein-level identifiers (if not identifiers of protein +
state). But when it's time for large-scale analysis of interaction data,
we tend to use the gene name, as it's easier to interpret (even if less
precise).  We do exactly the same thing in expression analysis, because
it's easier to interpret a statement on the expression level of "EGF"
than "the target of 206254_at".  So, there's different semantic levels
at which we describe the data to the end-user, and in the database.
When we merge databases, what we consider equivalent depends on the
semantic context of our operation.  

Melissa


> Hi,
>  I agree with Ken that gene name to database ID resolution is a major 
> issue in bioinformatics.  I'd also like to add that database ID to 
> database ID resolution (alias resolution) is a major issue as well.  I 
> think these issues represent a fantastic opportunity to showcase 
> semantic web technologies and drive their use in bioinformatics and I 
> would argue that if these issues can't be resolved, then the semantic 
> web will not function well for biologists.  We just need someone out 
> there to implement them.
>  To be more specific about use cases, here is an extract of a document I 
> wrote in the context of a pathway database, but the use cases are much 
> more general.  (And in fact, as David States mentioned at the BOF, more 
> links than these simple ones can be contemplated e.g. sequence 
> similarity links)
> 
> -----
> 
> Links between database identifiers are required for four specific 
> use-cases in a typical pathway database. Most of the database 
> identifiers are for molecules (proteins, small molecules), but 
> identifiers for complexes, interactions, pathways, molecular states, 
> etc. are also required, but are less important in the short to mid term 
> (e.g. next 1-2 years).
> 
> Use cases:
> 1. Unification during dataset merging: During a merge operation e.g. of 
> two protein-protein interaction datasets from independently created 
> databases, it is vital to recognize that two protein objects, one from 
> each data source, represent the same protein molecule, even if the 
> protein objects don’t share any database accession numbers. Unification 
> requires knowledge of record type e.g. you cannot reliably use a gene ID 
> to unify proteins (mostly because splice variants exist).
> 2. Link out to related references: When presenting information about a 
> protein to a user on a web page, it is useful to display links to 
> related information about the protein, such as further information about 
> the protein sequence and sequence feature annotations (e.g. in 
> SwissProt), Gene Ontology annotations, domains annotations (InterPro), etc.
> 3. Identifier translation: Some analysis methods require specific 
> translations from one set of identifiers to another.  For instance, the 
> ‘activity centers’ analysis requires translation from protein or gene 
> identifiers in a pathway database to Affymetrix probe set identifiers or 
> other gene expression array platform identifiers.
> 4. Searching for a favorite gene name: Preferred gene names used for 
> querying a pathway database should return all genes/proteins with that 
> name, if they exist in the database. Unlike database accession numbers, 
> gene names are not guaranteed unique, thus cannot reliably be used for 
> the other use cases.
> 
> Links are available from many sources, but not every source addresses 
> each use case (and none address all use cases).  All services that allow 
> all data to be downloaded can conceivably be used for all use cases with 
> the help of a separate software system that can store different link 
> types (e.g. unification links, link out links), although this also 
> requires recognition of record type (e.g. protein, small molecule, 
> reaction, etc.).
> 
> Mapping services
> 
> AliasServer
> http://cbi.labri.fr/outils/alias/
> Tool for identifier translation using CRC64 hash of the protein sequence 
> as a primary key. Provides unification, linkout and translation services 
> for a handful (~35) species for proteins only.  Supports use cases 2, 3.
> 
> Freely available for download. Regularly updated.
> 
> MD Anderson GeneLink
> http://bioinformatics.mdanderson.org/GeneLink.html
> ID translation and search service for human IDs (10 ID types).  Supports 
> use case 2, 3.
> 
> EnsMart
> http://www.ensembl.org/Multi/martview
> ID translation services for Ensembl genomes. Supports use case 2.
> 
> MatchMiner
> http://discover.nci.nih.gov/matchminer/html/index.jsp
> ID translation service for mouse and human. Supports use cases 2, 3.
> 
> Ariadne Genomics ID Mapping Service
> http://www.ariadnegenomics.com/services/idmap.html
> Tool for identifier translation. Supports 7 species and maps between 12 
> different ID types mainly for proteins and genes.
> Commercial service, not available for download
> Supports use case 3
> 
> GeneLynx
> http://www.genelynx.org/
> Provides linkout services for human, mouse and rat.  Supports use case 2.
> 
> NetAffx
> http://www.affymetrix.com/products/software/specific/netaffx.affx
> Provides ID translation services for Affy probe set IDs.  Supports use 
> case 3.
> 
> http://openbns.sourceforge.net/ - Supports use cases 2,3
> 
> Databases
> 
> International Protein Index
> http://www.ebi.ac.uk/IPI/IPIhelp.html
> A cross reference database for proteins in higher eukaryotic organisms 
> (5 species). Provides protein and gene cross references. Supports use 
> case 1.
> 
> Entrez Gene
> Provides detailed information on genes from multiple organisms including 
> gene aliases and links to NCBI related resources. Supports use case 4 
> (and 2 to some degree).
> 
> UniProt (SwissProt, PIR, TrEMBL) provides some information on links to 
> related resources and protein names.
> 
> 
> 
> 
> Ken I Fukuda wrote:
> > Hi all,
> > 
> > In the ISMB Semantic web for Life Science BOF,
> > an issue was raised about the ambiguity of how people
> > refer to a protein in the literature.
> > 
> > For example, let's say, you find a description such as
> > "JNK activates JUN" but acctually this "JNK" stands for
> > a bunch of proteins ("concrete entities") and JUN
> > also stands for a set of proteins.
> > 
> > This isssue is known as the "generic entitity" problem.
> > If you read the literature, you typically encounter these
> > "generic protein" names.
> > And there should be a mechanism that tells you how many
> > proteins you have for each generic name.
> > 
> > An ontology for generic/concrete protein names, called
> > "MoleculeRole Ontology" is available from
> > http://www.inoh.org/ontology-viewer/.
> > Actually, it is a DAG structured controled vocabulary (CV).
> > The current version covers about 4400 Uniprot IDs which means
> > that the CV defines generic/concrete protein relations for
> > more than 4400 concrete proteins.
> > 
> > The CV is available in OBO format (Gene Ontology native format).
> > http://www.inoh.org/download.html
> > 
> > PS.
> > There are some OBO->OWL converters, but some argued they didn't
> > fit their needs. It would be nice to know how people like to
> > convert an OBO ontology into an OWL file.
> > 
> > Best,
> > Ken
> > 
> > --------------------------------------------- 
> > Ken Ichiro Fukuda, Ph.D.
> > Computational Biology Research Center (CBRC)
> > National Institute of 
> > Advanced Industrial Science and Technology (AIST)
> > AIST Tokyo Waterfront Bio-IT Research Bldg. 10F
> > 2-42 Aomi, Koutou-ku, Tokyo 135-0064 JAPAN
> > Phone: +81-3-3599-8049  FAX: +81-3-3599-8081
> > fukuda-cbrc@aist.go.jp / fukuda_cbrc@yahoo.co.jp
> >      - http://www.cbrc.jp/~fukuda/index.html
> > - INOH Pathway Database Project -
> >      - Integrating Network Objects with Hierarchies
> >      - http://www.inoh.org
> > 
> > 
> > 
> > 
>
Received on Friday, 1 July 2005 13:23:24 UTC