Re: blog: semantic dissonance in uniprot

2009/3/22 Michel_Dumontier <Michel_Dumontier@carleton.ca>:
> Eric and friends,
>
>
>
>  I’m very sympathetic to the simplifying assumption of not distinguishing
> between a record and the molecular entity it represents, but there are some
> important considerations. First, we need to be cautious in the
> transformation of recorded facts (as they appear in these database records)
> to class restrictions on biomolecules in logic-based (e.g. OWL) ontologies.
> Initially, we might say that a class biomolecules share a particular
> molecular structure (or biopolymer sequence), but assertions of role,
> function, PTMs, and involvement in biological process (among others) are
> contextual or temporally qualified and as such it may not be appropriate to
>  generalize to all instances. For example, some protein records list all of
> the _known_ PTMs .. hardly the basis to generalize that all instances will
> also have those PTMs at those positions at all (or any!) time. This is
> clearly a major knowledge representation challenge, in which we should
> engage in different approaches to improve our representation. Class-based
> representations are necessary as there is a need to refer to specific real
> world instances, whether they be collections of molecules in a test tube,
> electron micrographs that show individual macromolecular complexes or atomic
> force microscopes that manipulate them. In the meantime,  we’ll probably
> continue to model database records as instances of their corresponding
> entity.

Class based assertions are useful, but in cases where databases are
very large, it is hard to distinguish between records which
tentatively make a class assertion, and those that more fully make a
class assertion. The general rule I have followed is to provide these
evidence statements together with the record, and enable people to
make the assertions based on their level of evidence. Evidence is all
you can say on a large scale in my opinion, as the biggest curated
databases can realistically not do much more than ensure that a given
thing has a publication behind it. That is science. If someone wants
to do a model that relies on their personal level of surety about a
given thing it is more likely that they will recreate a world for
themselves and import in the various entities that they require by
vague reference. ie, use a relatively non-descript property to tell
people where their evidence comes from, but not actually rely on the
semantics given to the publically curated entity for their purposes.

If we can at least make it easy for people to let others know where to
find more information about the things they are using inside of their
novel datasources than we will have some success in bringing together
publically available extra information about records. Making an
assertion that we are going to decide what the actual class is for an
asserted record type and use two or more URI's to distinguish this at
the world level won't give either URI more evidence or data clarity in
my opinion.

>  There is no doubt that it is challenging to devise a consistent naming
> scheme – and nearly each member of the steering group has worked out some
> way to do this (e.g. [1][2]). If the sharednames group wants to recommend an
> consensual approach on the _syntax_ of any given name, with appropriate
> rationale, then it’s possible that more people will use it as a guiding
> principle. However, attempts to _control_ the naming process will result in
> an undoubtedly unreceptive audience. Will a registry of names prevent people
> from making similar or identical (literal) names?  no. Establishing a
> self-registry of namespaces like bio2rdf [3] or lsrn.org is a more worthy
> goal. I, like several others, am interested to see how the committee will
> “make sure that its URIs … resolve to information that is useful”. I expect
> that this will be challenging to establish utility, particularly in the
> context of a term contained in an expressive ontology.

Useful is good. In both the namespace naming case and the syntax case
I have put forward arguments in the past for both lsrn and bio2rdf as
namespace congregation points, and essentially namespace:identifier
and namespace/identifier as the two alternatives for syntax.
Particularly with the ability to make up multiple namespace based on
any given dataset, there needs to be a way of either formally telling
people they are equal, with owl:sameAs, or informally telling them if
there isn't a one-to-one correspondence between them and a mapping
isn't simple to do in the general case.

>  I applaud efforts to publish data in an open and linked manner. But
> somewhat disconcerting is that I’m (controversially) sure we’ll find
> ourselves in the awkward position that there will be too much meaningless
> linked data, in which we’ll have to filter useful, less useful, to
> identical, useless or worse, misguiding or erroneous. It’s not hard to
> imagine this happening. Applying the correct semantics to create meaningful
> relations is of fundamental importance for answering questions about our
> collective knowledge. Linking concepts or data with clearly defined semantic
> links (e.g. SKOS, RO, OWL) is  indeed useful, and its utility goes beyond
> Linked Data. Eric’s appeal, that we should be careful to (meaningfully) link
> to third party über- URIs, resonates for the same reason that you may want
> to say something about an entity that other people won’t necessarily agree
> with. The truth is that we all have different perceptions of reality, and
> our knowledge about the world is in constant flux. We should be able to
> express our knowledge to our degree of satisfaction. In a competitive,
> distributed environment that is the web, people will choose terms and
> ontologies that best agrees with their perception and with their
> requirements. As a nascent scientific community, so early in the game of
> designing accurate, expressive and meaningful ontologies, we should
> encourage new ideas and ensure competition among them.
>

It would be nice to be able to have more than just dbxref as a
relation, but in many cases the database owners do not provide more
semantics that would be applicable over the whole dataset, so
inevitably there will have to be some way of filtering based on the
other fields that have been decided on as characteristic of a
particular class/thing in the context of one's novel ontology. This
segregation inevitably leads to new namespaces (read URI's) being
created for different classes of knowledge, but hopefully with linked
data they can still be interrelated, if only to be able to trace the
evidence for a particular part of an advanced ontology for someone
wishing to evaluate the ontology in terms of what other people have
used the same evidence to do in the past.

Cheers,

Peter

Received on Saturday, 21 March 2009 21:43:09 UTC