- From: Roderic Page <r.page@bio.gla.ac.uk>
- Date: Fri, 30 Mar 2007 12:00:32 +0100
- To: public-semweb-lifesci@w3.org
Dear Matt, > > I was wondering what the rules are for creating the actual identifier > that bioGUID would end up using to reference this database record. > The rules, such as they are, are in my previous post. > > I'm not sure. There seem to be various adventures of GO in RDF around > the place. I think GONG is worth looking at (http://gong.man.ac.uk/). > But I think I misunderstand something: why do these other resources > need a format that can be converted to RDF? I thought you were > interested in just the links between records in different databases > and were providing an RDF layer over this, not actually trying to also > represent all the content inside these records also. > No, I want content as well. For example, I want bibliographic details for a paper, I want latitude and longitudes for voucher specimens, etc. > So there is some assumptions you make on the meaning of a link in a > record? For example how would you handle the link that > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=4503913 > has to http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi? > db=protein&val=7669492 > which is given in the XML form of the record as: > > <GBSeq_comment>[WARNING] On Apr 28, 2000 this sequence was replaced > by gi:7669492.; PROVISIONAL RefSeq: This is a provisional reference > sequence record that has not yet been subject to human review. The > final curated reference sequence record may be somewhat different from > this one.</GBSeq_comment> > > and the intended interpretation is 'superseding'? Would I be right in > thinking your bioGUID database is able to parse records from entrez > and interpret the meaning of the links so as to write out a useful > description of the link? Or would you be looking to entrez to supply > this information in a more informative (at the machine level) > rdf/rdf-s based form? Gack. I haven't looked at records like this. So far I look at GenBank records and extract the obvious links (say to PubMed and NCBI taxonomy). I also look at the reference records to see if I can extract enough information to do a DOI lookup for the publication (not all GenBank sequences are linked to publications in PubMed). I spend a lot of time trying to interpret the mess that is the information on the voucher specimen, such as parsing the "isolate", "specimen_voucher", and "lat_long" records, trying to see whether there is a link to a specimen that has an online representation (a good number of voucher specimens in museums have digital records I can access). I've still to deal with host association (e.g., I want to have a link to the host of a parasite). I haven't looked at the other links yet, such as links between proteins and nucleotides, or the kinds of things you mentioned above. > > So this would mean that you wouldn't make any decisions yourself that > one bioGUID record should reference another because you or someone > else thinks it should, it solely relies on parsing these data sources > and extracting 'database links'. > Not quite, I add links where possible. For example, http:// bioguid.info/gi:90184449 has a link to doi:10.1206/0003-0090(2006)297 [0001:TATOL]2.0.CO;2, which isn't in the original GenBank record (http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi? db=nucleotide&val=90184449). Likewise, the specimen links are added. > > You would want the person to provide you with a set of generated > bioGUIDs that each resolve and return relevant RDF which include the > cross references to other bioGUIDs (i.e. they have already worked out > by the provider)? That would be nice, but for all the providers I care about, I've had to do a lot of fussing from scratch. > > Each database tends to have it's own API for inspecting records. Often > it is necessary to follow a chain of records from different databases > to end up with a record in the database you want. For example: you > have a sequence id (protein_gi) for an orthologue and you wanted to > retrieve the an enzyme record, you can look the protein_gi number up > in the entrez database, locate the record, and follow the link to the > expasy enzyme record or perhaps directly query the KEGG db. All of > these steps employ database specific APIs aware of the data format. I > was merely suggesting your RDF graph would merge the linking into a > common format which is nicer to handle. I sort of envisaged that people would follow the chain themselves, and store the results locally. I would provide neighbours for each GUID, but would follow the graph (this could potentially explode). Regards Rod ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html iChat: aim://rodpage1962 reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com Rod's rants on ants: http://semant.blogspot.com
Received on Friday, 30 March 2007 11:01:56 UTC