Re: bioGUID

On 3/30/07, Roderic Page <r.page@bio.gla.ac.uk> wrote:
> Dear Matt,
>
> > Do you have any publications that outline the motivation here (except
> > the LSIDs don't work for the semantic web argument you have outlined
> > in your online material)?
>
> No publication as yet on bioGUID, but I'm working on some rough notes.
> In essence the motivation is to get biodiversity informatics using RDF,
> and I don't think LSIDs are the best way to get us there. For something
> related you could look at my article "Taxonomic names, metadata, and
> the Semantic Web"
> (http://jbi.nhm.ku.edu/index.php/jbi/article/view/25).
>
> > What are the rules for generating a URI for a particular database
> > record? For example:
> > http://bioguid.info/rdf/GO:0003674 does not work, and neither does
> > http://bioguid.info/rdf/go:0003674 ... is the gene ontology not
> > referenced yet? or do I have the rule wrong?
>
> bioGUID needs to know how to resolve the identifier,

I was wondering what the rules are for creating the actual identifier
that bioGUID would end up using to reference this database record.

> which in turn
> means that there is some way to get metadata about the identifier
> (although I will resort to having local copies of databases if I have
> too).
>
> To resolve a GO identifier there needs to be a service somewhere that
> takes a GO identifier and returns metadata either in RDF, or in a
> format that can be converted to RDF. Is there any such service? If not,
> I would have to host a copy of GO here, and write something to output a
> GO term in RDF.

I'm not sure. There seem to be various adventures of GO in RDF around
the place. I think GONG is worth looking at (http://gong.man.ac.uk/).
But I think I misunderstand something: why do these other resources
need a format that can be converted to RDF? I thought you were
interested in just the links between records in different databases
and were providing an RDF layer over this, not actually trying to also
represent all the content inside these records also.

>
> > Are you trying to achieve a mechanism for uniquely identifying links
> > to important bioinformatic records by URI?
>
> I guess I'm trying to show the value of having such URIs, because my
> sense is that -- within the biodiversity informatics community at least
> -- people haven't bought the argument yet. It's hard to make the case
> without real demos. Plus my own work depends on having such an
> infrastructure in place.
>
> > Would you imagine this might become a primary conduit for people to
> > locate a database record when they have database record IDs, rdf
> > tools, but no idea how to use these tools to access the record data
> > (I'm not suggesting the record data is itself RDF).
>
> Ideally no, because I would hope data providers would have their own
> URIs that are stable and return RDF. For example, the database record
> that a user has should itself be a URI. However, as an interim step,
> yes, bioGUID can be a way to locate a record in the absence of knowing
> how to do that, and in some cases, it may be the only currently
> existing way to do that, unless you want to write your own code. For
> example, accessing a museum specimen record requires writing a XML
> document and embedding that in a URL (gack).

A stepping stone in the essence of bioGUID is definitely needed.

>
> >
> > Are you trying to achieve a cross referencing system for database
> > records? And if so, on what basis is a cross-reference made?
>
> The cross reference uses bioGUIDs. For example, if a PubMed record
> contains a DOI, the RDF will have a triple linking the PubMed and the
> DOI using rdfs:sameAs. If a PubMed record lists sequence ids, they are
> converted to bioGUIDs. I use bioGUIDs so that the link can be navigated
> by a Semantic Web browser.

So there is some assumptions you make on the meaning of a link in a
record? For example how would you handle the link that
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=4503913
has to http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=7669492
which is given in the XML form of the record as:

  <GBSeq_comment>[WARNING] On Apr 28, 2000 this sequence was replaced
by gi:7669492.; PROVISIONAL RefSeq: This is a provisional reference
sequence record that has not yet been subject to human review. The
final curated reference sequence record may be somewhat different from
this one.</GBSeq_comment>

and the intended interpretation is 'superseding'? Would I be right in
thinking your bioGUID database is able to parse records from entrez
and interpret the meaning of the links so as to write out a useful
description of the link? Or would you be looking to entrez to supply
this information in a more informative (at the machine level)
rdf/rdf-s based form?


>
> > (for
> > example the record referenced itself back references the record making
> > this reference)
>
> Um, huh? Do you mean, if a PubMed record references a sequence, does
> the sequence reference the PubMed record?

No, but you answered it above. I was asking how you decided that one
bioGUID record should reference another. You have identified that it
is when one external database record contains references to the other
external database record.

So this would mean that you wouldn't make any decisions yourself that
one bioGUID record should reference another because you or someone
else thinks it should, it solely relies on parsing these data sources
and extracting 'database links'.

> The answer is it depends. In
> the case of a PubMed record and a sequence, in most cases yes, hence
> for http://bioguid.info/pmid:17079492 there is a triple
>
> <dcterms:references rdf:resource="http://bioguid.info/gi:117652796" />
>
> and for http://bioguid.info/gi:117652796 there is a triple
>
> <dcterms:isReferencedBy
> rdf:resource="http://bioguid.info/pmid:17079492"/>
>
> These are easy because the PubMed and the GenBank records refer to each
> other. In other cases both links don't exist -- for example, a specimen
> has no idea whether a sequence links to it. I could add the reverse
> link in these cases, but I'd sort of assumed that people could use a
> SPARQL query to get these.
>
> >
> > How would people apply to have their databases added?
> >
>
> Basically just ask me.

You would want the person to provide you with a set of generated
bioGUIDs that each resolve and return relevant RDF which include the
cross references to other bioGUIDs (i.e. they have already worked out
by the provider)?

> So far I'm adding data sources that are directly
> relevant to my own work, but since that includes sequences, that pretty
> much opens up most things in bioinformatics. I'm also looking at adding
> list of triples (such as citation links) to the underlying triple
> store, so the bioGUID records become richer than just a remote database
> lookup.
>
>
> > The immediate use-case that springs to mind is being able to drop the
> > crop of libraries one needs to interpret records in one database to
> > find accession numbers for another database and so on until you find
> > sort of what you are looking for in the actual database you want.
>
> Not totally sure I understand this.

Each database tends to have it's own API for inspecting records. Often
it is necessary to follow a chain of records from different databases
to end up with a record in the database you want. For example: you
have a sequence id (protein_gi) for an orthologue and you wanted to
retrieve the an enzyme record, you can look the protein_gi number up
in the entrez database, locate the record, and follow the link to the
expasy enzyme record or perhaps directly query the KEGG db. All of
these steps employ database specific APIs aware of the data format. I
was merely suggesting your RDF graph would merge the linking into a
common format which is nicer to handle.

> My own immediate use case is to
> have a script that will fetch a record with a bioGUID, and have that
> script fetch every linked record referred to by that record (i.e., RDF
> spidering). For example, if I start with a PubMed identifier, the
> script would pull out all the sequences in that paper, any specimens
> linked to those sequences, the taxonomy of the organisms, and the
> papers cited by the PubMed paper. I would then have a local triple
> store for this information, and be able to do stuff like plot a
> geographical map of the sequences based on the georeferenced specimen
> records.
>
> Regards
>
> Rod
>
> ------------------------------------------------------------------------
> ----------------------------------------
> Professor Roderic D. M. Page
> Editor, Systematic Biology
> DEEB, IBLS
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QP
> United Kingdom
>
> Phone:    +44 141 330 4778
> Fax:      +44 141 330 2792
> email:    r.page@bio.gla.ac.uk
> web:      http://taxonomy.zoology.gla.ac.uk/rod/rod.html
> iChat:    aim://rodpage1962
> reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
>
> Subscribe to Systematic Biology through the Society of Systematic
> Biologists Website:  http://systematicbiology.org
> Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/
> Find out what we know about a species: http://ispecies.org
> Rod's rants on phyloinformatics: http://iphylo.blogspot.com
> Rod's rants on ants: http://semant.blogspot.com
>
>

cheers
Matt

Received on Friday, 30 March 2007 10:22:10 UTC