Reviewing the Banff demo ontology infrastructure from samwald@gmx.at on 2007-06-10 (public-semweb-lifesci@w3.org from June 2007)

From: <samwald@gmx.at>
Date: Sun, 10 Jun 2007 22:20:51 +0200
To: public-semweb-lifesci@w3.org
Message-ID: <20070610202051.251080@gmx.net>

Reviewing the Banff demo ontology infrastructure

Alan did a great job at coordinating previously separate ontologies of several participants of the HCLSIG into a coherent infrastructure for the Banff demo. As we all agree, we should try to keep the momentum going and keep the pace of ontology integration and data conversion that was driven by the deadline for the demo. However, as discussed in the previous BioRDF call, I think we should take a little time to review the ontological constructs that were created to make the demo possible. If we are continuing to extend our ontology infrastructure, we want to make sure that we all understand and agree on the fundaments that the infrastructure is built upon, and that it does not contain minor glitches that were overlooked in the heat of demo preparation.

1) What relations do we use to connect a biological entity with artificial entities describing it, e.g. ‘protein records’, ‘sequence records’, ‘Pubmed records’?

In the current ontology, we use relations like in the following examples:
* ‘Protein_1 has_preptide_sequence_described_by peptide_sequence_record_1’
* ‘Protein_2 is_protein_gene_product_of_dna_described_by gene_record_1’
* ‘Gene_record_1 describes_gene_or_gene_product_mentioned_by journal_article_1’

We are not only using these properties to link our classes of biological entities to certain database entries, but also to define our classes in cases where there is no accepted standard ontology (e.g. for proteins). For example, we can partly define the class ‘insulin_protein’ through several necessary&sufficient property restrictions that relate the protein class to some or all of the currently known proteins sequence records describing insulin proteins (a very practical approach).

However, I think the properties we are using right now might be problematic in the long term, because:
*) The properties are somewhat redundant. Since we are using OWL, all of our resources are typed, which means that in a relation like ‘Protein_1 has_peptide_sequence_described_by peptide_sequence_record_1’, we already know that we are dealing with at a protein and a peptide sequence record. In most cases there is not much that we need to disambiguate: of course, the peptide sequence record describes the peptide sequence of the protein, and not its shape, colour or smell. The same statement could be made with a generic ‘described_by’ relation without significant ambiguity. Certainly we might encounter database records where the situation is less clear, but these are in the minority. In such a case, we could still use our generic ‘described_by’ relation in most cases, but we could NOT use it to define a class in a ‘necessary&sufficient’ restriction. Not that bad.

*) When our ontologies are expanded to further fields of biology, it leads to the creation of a large collection of properties that are hard to manage and query. If one simply wanted to query all the database entries describing a biological entity, one would need to enumerate the long list of relations in the query. Again, this is a good argument for creating a generic ‘described_by’ relation; if not as a replacement for our current properties, than at least as a superproperty that acts as an umbrella for all the other properties.

*) Clearly, we want to focus our attention on the description of biological reality, and not on the description of the database artefacts that needed to be created in the pre-Semantic Web era. With the current solution, we are moving some of the biological information into the realm of information entities, which counters our intentions. We should try to ground our descriptions in biological reality, as far as possible.

For example, ‘Protein_2 is_protein_gene_product_of_dna_described_by gene_record_1’ would better be described through two statements like

‘Protein_2 encoded_by Gene_1’
‘Gene_1 described_by gene_record_1’

This way, we can focus on describing biology, and have better opportunities to refine our statements later on (e.g. making statements about the gene itself). I know that Alan had some reasons why he did not want to introduce a gene class, but this should only serve as a specific example for a general design pattern.

2) What is evidence?

In our demo, we are using the ‘evidence codes ontology’ with some small additions. The ‘evidence codes’ are subclasses of ‘report’, which is a subclass of ‘textual_thing’. Examples are ‘immunulogical_cross_reaction’, ‘similar_substrate_specifity’, ‘inferred from genomic analysis’, ‘inferred from bioassay’ etc.
Most of these classes would better be represented as processes, e.g. processes defined in an ontology of biological experimentals procedures: the experiments and procedures ‘immunulogical cross reaction’, ‘comparison of substrate specifities’, ‘genomic analysis’, ‘bioassay’.
Of course, evidence for the existence of a certain biological entity can also be seen in journal papers, books or similar things. I guess we should keep our constructs for the description of evidence relatively loose. However, like in the section above, it would again be preferable if would try to introduce as few abstractions and artefacts as possible, and try to rely on using direct description of experimental procedures (processes) for evidence statements.

3) How are information resources (e.g. the very abstract ‘database entry’, or the slightly less abstract ‘XML document associated with a database entry’) best represented in BFO-friendly ontologies?

These entities seem to be in conflict with the realism of BFO-friendly ontologies, yet we need to represent them somehow. This is probably a discussion for the BFO Google Group, but I could not get it started so far.
Currently, we are classifying several such entities under bfo:Object, e.g. protein records, MeSH qualifiers, terms, notes and journal articles. I have the suspicion that this might be a problem.

These issues will be discussed in the BioRDF (BioOnt?) teleconference tomorrow.

cheers,
Matthias Samwald

----------

Yale Center for Medical Informatics, New Haven /
Section on Medical Expert and Knowledge-Based Systems, Vienna /
http://neuroscientific.net

.
--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

Received on Sunday, 10 June 2007 20:21:12 UTC