Re: bioGUID from Roderic Page on 2007-03-30 (public-semweb-lifesci@w3.org from March 2007)

From: Roderic Page <r.page@bio.gla.ac.uk>
Date: Fri, 30 Mar 2007 12:00:32 +0100
To: public-semweb-lifesci@w3.org
Message-Id: <98F29396-11A4-48B9-92D0-115348F23037@bio.gla.ac.uk>
Dear Matt,

>
> I was wondering what the rules are for creating the actual identifier
> that bioGUID would end up using to reference this database record.
>

The rules, such as they are, are in my previous post.


>
> I'm not sure. There seem to be various adventures of GO in RDF around
> the place. I think GONG is worth looking at (http://gong.man.ac.uk/).
> But I think I misunderstand something: why do these other resources
> need a format that can be converted to RDF? I thought you were
> interested in just the links between records in different databases
> and were providing an RDF layer over this, not actually trying to also
> represent all the content inside these records also.
>

No, I want content as well. For example, I want bibliographic details  
for a paper, I want latitude and longitudes for voucher specimens, etc.


> So there is some assumptions you make on the meaning of a link in a
> record? For example how would you handle the link that
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=4503913
> has to http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi? 
> db=protein&val=7669492
> which is given in the XML form of the record as:
>
>  <GBSeq_comment>[WARNING] On Apr 28, 2000 this sequence was replaced
> by gi:7669492.; PROVISIONAL RefSeq: This is a provisional reference
> sequence record that has not yet been subject to human review. The
> final curated reference sequence record may be somewhat different from
> this one.</GBSeq_comment>
>
> and the intended interpretation is 'superseding'? Would I be right in
> thinking your bioGUID database is able to parse records from entrez
> and interpret the meaning of the links so as to write out a useful
> description of the link? Or would you be looking to entrez to supply
> this information in a more informative (at the machine level)
> rdf/rdf-s based form?

Gack. I haven't looked at records like this. So far I look at GenBank  
records and extract the obvious links (say to PubMed and NCBI  
taxonomy). I also look at the reference records to see if I can  
extract enough information to do a DOI lookup for the publication  
(not all GenBank sequences are linked to publications in PubMed). I  
spend a lot of time trying to interpret the mess that is the  
information on the voucher specimen, such as parsing the "isolate",  
"specimen_voucher", and "lat_long" records, trying to see whether  
there is a link to a specimen that has an online representation (a  
good number of voucher specimens in museums have digital records I  
can access). I've still to deal with host association (e.g., I want  
to have a link to the host of a parasite).

I haven't looked at the other links yet, such as links between  
proteins and nucleotides, or the kinds of things you mentioned above.



>
> So this would mean that you wouldn't make any decisions yourself that
> one bioGUID record should reference another because you or someone
> else thinks it should, it solely relies on parsing these data sources
> and extracting 'database links'.
>


Not quite, I add links where possible.  For example, http:// 
bioguid.info/gi:90184449 has a link  to doi:10.1206/0003-0090(2006)297 
[0001:TATOL]2.0.CO;2, which isn't in the original GenBank record  
(http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi? 
db=nucleotide&val=90184449). Likewise, the specimen links are added.


>
> You would want the person to provide you with a set of generated
> bioGUIDs that each resolve and return relevant RDF which include the
> cross references to other bioGUIDs (i.e. they have already worked out
> by the provider)?

That would be nice, but for all the providers I care about, I've had  
to do a lot of fussing from scratch.

>
> Each database tends to have it's own API for inspecting records. Often
> it is necessary to follow a chain of records from different databases
> to end up with a record in the database you want. For example: you
> have a sequence id (protein_gi) for an orthologue and you wanted to
> retrieve the an enzyme record, you can look the protein_gi number up
> in the entrez database, locate the record, and follow the link to the
> expasy enzyme record or perhaps directly query the KEGG db. All of
> these steps employ database specific APIs aware of the data format. I
> was merely suggesting your RDF graph would merge the linking into a
> common format which is nicer to handle.


I sort of envisaged that people would follow the chain themselves,  
and store the results locally. I would provide neighbours for each  
GUID, but would follow the graph (this could potentially explode).


Regards

Rod

----------------------------------------
Professor Roderic D. M. Page
Editor, Systematic Biology
DEEB, IBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QP
United Kingdom

Phone: +44 141 330 4778
Fax: +44 141 330 2792
email: r.page@bio.gla.ac.uk
web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
iChat: aim://rodpage1962
reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html

Subscribe to Systematic Biology through the Society of Systematic
Biologists Website: http://systematicbiology.org
Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
Find out what we know about a species: http://ispecies.org
Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Rod's rants on ants: http://semant.blogspot.com
Received on Friday, 30 March 2007 11:01:56 UTC