Re: blog: semantic dissonance in uniprot

2009/3/21 eric neumann <ekneumann@gmail.com>:
> Scott,
> Funny, I was just about to send a message on a very similar issue; may be
> it's what you're referring to, but let me know either way...
> After talking with many folks in industry over the last several months, it
> is becoming quite clear that when dealing with a molecular reference, such
> as uniprot or entrez-gene, we should also be treating it as a form of "proxy
> of the thing" with something akin to transitivity. Why, because they are the
> best reference we have to a protein entity (exemplar). No wonder real-world
> scientists refer to these records as "the gene" or "the protein". I for one
> see keeping things from becoming unnecessarily complicated as key to
> successfully advancing the semantic web in LS.
> Here are some reasons why we should consider regarding this typing issue:
>
> There is no such thing as a referenceble instance of a specific instantiated
> molecule ("that specific molecule"); all gene, protein, and chemical records
> are about the category or group of exemplar molecules:
> SAME molecular structure, NOT SAME atoms (so we already aren't really things
> in the real world ;-) ); all molecular databases are based on this asserted
> fact.
> Most users of molecular information aren't ignorant about the difference
> between a protein and a record of a protein; it's just that they don't want
> to deal with all the extra CS mechanics (that prevent getting their job
> done). And so an instance of a protein record in a database (or a reference
> to it from another database) is the closest thing to saying: "here's the
> protein".
> Different records exist for the same protein, which indeed has been a
> historic point of complication; but this is really a social issue, not a
> semantic one, and the key data authorities have already for years
> coordinated on this point by supplying cross-references to each
> other. Occasionally, when we realize a gene was incorrectly identified, the
> record is merged or deprecated, and one group fixes things usually before
> the other. It would appear that it's beneficial not to coerce the different
> authorities pre-emptively to point to any other third party über-gene URI;
> each should correct when it has sufficient evidence, and share that change
> so that references from each quarter can be corrected. This is also sound
> form a progression of science perspective; the different agencies through
> their interactions will eventually find the "better truth" .
> If one creates a new node or URI for "the gene ABL-Human", and link all
> other data records to it,  it is by any definition 'also' a digital record
> (even without a URI); hence if one follows this logic to its formal
> conclusion, we have a system of references about records, that are about
> records, that are about records... and never quite get to the true instance
> of a gene. Voila! we've re-created Russell's Paradox using gene records!
> The body that decides and creates "a higher form of protein record" that
> others must reference, is going to be suspect by all other authorities; if
> it is done by committee, I fear it will add a lot
> more unnecessary confusion; does it get annotated? By whom? How is this
> regulated by the communities experts and authorities? Do we allow open
> season for all annotators, but keep everything sequestered in local SW
> zones? I think this open an interesting but entangled can of worms...
>
> I believe it's therefore best not to define protein records types separate
> from proteins, at least for general consumption by informaticists. Some day
> this may indeed be easy and useful, but I don't see it being the right thing
> to invest in right now...
> So what should we do for now? When should we think about proteins and when
> about protein records? Well, doesn't that really depend if you are a data
> source curator like SIB or a consumer of molecular information?  Using RDF
> typing, both can be asserted at the same time, as long as we don't build in
> any contradictions. EMBL, SIB and NCBI can treat all such records as special
> "curated record classes", but expose them outwardly as "Gene" or "Protein",
> or "micro RNA".
> For most of us who use such online information, this is something that
> really is not so complicated-- however, when writing new tools to handle new
> semantic complexities, one almost invariably experiences unpredicted side
> effects... it's the software that could become confused. I recommend we keep
> it simpler for now, and don't add semantic features that end-users can not
> benefit immediately from while making it more complicated to use.
> cheers,
> Eric
>

It is always interesting to explore just how dirty the data
bioinformaticians utlise is. And yet, they can find reasonable
conclusions---if not perfect and backed by distinct formal statements.

I don't think that things have gone downhill for science since the
first abstractions were made from situations to the resulting
statistical analytical conclusions. As you say, science doesn't
consider itself rock solid, even when a theorem hasn't changed for
hundreds of years. The idea that we are setting up the semantic web on
the first try to survive for that long with consistent URI's together
with stable ontologies and type systems, is a bit excessive.

It will be nice to have next-generation tools supported by interlinked
RDF versions of current databases, but I fear that they will only be
utilising the fact that the databases contain crossreferenced URI's
rather than any rules associated with the rdf:type or other
statements.

If current systems (eg, SRS) can utilise common database references
for niche applications then an RDF system designed around the idea of
a common URI reference format will be suited even moreso. If one takes
a simple predicate for stating that the heterogeneous references are
all database records with the same informational content (eg,
owl:sameAs in the absence of any other well recognised solution) then
the segments can be reconciled in my opinion. A given query can still
then specify a particular database, or utilise the databases together
with the same semantics, just with each of the component queries tuned
to the relevant properties exposed by the particular databases. This
keeps the current system where multiple biological databases are
available for most topics, with slightly different focuses and
curation procedures, but with mutual crossreference.

Interestingly with the datasources we have integrated into Bio2RDF so
far, we haven't assumed anything special about each particular
representation, and we haven't insisted on RDF producers using the
Bio2RDF URI's if they already have a system in use, if only because
there is nothing really unique about them. If the original URI's
contain an identifier, and reference in some way the other known URI's
which also contain identifiers than queries across the datasources can
be made, and the results associated back to the original URI's. If the
identifiers are merged then the information will filter down into the
semantic web, with the possibility depending on the database of
leaving a trail of references that detail why an identifier is
obsolete and if there is an alternative available.

If the datasources do contain equivalent informational content then it
is useful to standardise the data properties, or if possible provide a
transformation so that people don't have to write separate queries for
each datasource if that can be avoided.

Cheers,

Peter

Received on Saturday, 21 March 2009 13:02:20 UTC