Re: [BioRDF] Comments from Christoph Grabmuller on BioRDF microarray provenance from mdmiller on 2010-11-09 (public-semweb-lifesci@w3.org from November 2010)

From: mdmiller <mdmiller53@comcast.net>
Date: Tue, 9 Nov 2010 08:45:01 -0800
To: "Christoph Grabmuller" <grabmuel@ebi.ac.uk>
Cc: "HCLS" <public-semweb-lifesci@w3.org>, "Ravi Shankar" <rshankar@stanford.edu>
Message-ID: <133FDC1C3DE34599902F25D4152CF402@mmPC>
hi cristoph,

"That looks like a very useful tool. Out of curiosity: how are the
ontologies/vocabularies loaded?"

i did not actually work on that piece so don't know the details.  i do know 
that it is still a work in progress, i've cc'd ravi who is the one who wrote 
that code (the UI for the application is written in adobe's flex air)

"Yes, the HGNC Gene Symbols are stable, but what about other species?"

good point, that's where i've seen entrez accessions usually used as the 
most definitive source.

"So entrez accessions are the 'standard' input format for genes?"

this i've seen as more a de facto best practice than anything for 
identifying a sequence (but i'm not sure what you mean by a standard input 
format).  it seems to work well for identifying a sequence/gene for the 
purpose of communicating things such as gene lists, the database is well 
curated.

as we have all noticed 'what is a gene' is not something that has a 
definitive answer.  i'm currently working on a personal project where the 
pragmatic answer for 'what is a gene' is simply the 'cross referencing set 
of sequence identifiers'.  by this i mean where i download the information 
from HGNC, i get cross references to a number of other sequence databases 
such as entrez.  when i parse an ADF file i get additional cross references 
from the local identifiers to entrez, genbank, etc, and so on.  the danger 
of this, of course, is if there is a bad cross reference that associates, 
say, two different gene symbols.  it is also important to distinguish what 
cross references are to a broad category (GO symbols for instance) that 
shouldn't be used.  part of the project is to explore this idea and its 
ramifications.

this leaves to others more knowledgeable the actual identification work of 
genes and proteins and their relationship.

cheers,
michael

----- Original Message ----- 
From: "Christoph Grabmuller" <grabmuel@ebi.ac.uk>
To: "mdmiller" <mdmiller53@comcast.net>
Cc: "M. Scott Marshall" <mscottmarshall@gmail.com>; "HCLS" 
<public-semweb-lifesci@w3.org>
Sent: Tuesday, November 09, 2010 2:09 AM
Subject: Re: [BioRDF] Comments from Christoph Grabmuller on BioRDF 
microarray provenance


On Mon, Nov 8, 2010 at 4:02 PM, mdmiller <mdmiller53@comcast.net> wrote:
> 2) Many 'things' are represented as strings (e.g. genes), which makes
> it often impossible to run a federated query against another endpoint.
> Gene names might somewhat consistent for HUGO, but what about other
> species? Also, just the simple variance between 'STEAP2' and 'Steap2'
> makes a (direct) federated query impossible.
>
> * actually, HGNC Gene Symbols and entrez accessions are very stable. for
> ArrayExpress, the ADF file will usually map to one or both of these
> identifiers. in practice, i've not seen this to be a problem but for the
> paper we didn't go far enough.
> --mm

Yes, the HGNC Gene Symbols are stable, but what about other species?
So entrez accessions are the 'standard' input format for genes?

And even with HGNC it's not always that easy. Let's say I want to ask
bio2rdf what the uniprot accession is for the symbol 'CFTR':
http://bio2rdf.org/uniprot:P13569 only contains 'CFTR_HUMAN' and
matching that with 'FILTER regex()' is highly impractical across so
much data.
-cg

> 3) I like the Excel to RDF converter, but it relies on the user
> entering correct namespaces, names and database ids from various
> places in a syntactically correct way. This requires knowledge of the
> correct databases to choose and the 'correct' uri (many variants to
> chose from).
> If people just enter strings we are not all that far away from MAGE-TAB.
>
> * i'm involved in an open source project, Annotare, that seeks to put a 
> nice
> UI on top of creating MAGE-TAB documents for a bench scientist. part of
> that is use of the NCBO tools to make it easy for the creator of the
> document to go fetch the appropriate term from the appropriate
> onotlogy/vocabulary. version one has support for EFO built-in, one of the
> main goals for version 2 is to make this much easier and much broader.
> --mm

That looks like a very useful tool. Out of curiosity: how are the
ontologies/vocabularies loaded?
-cg
Received on Tuesday, 9 November 2010 16:45:44 UTC