(Annotea, OWL, LSID)
A Framework for Annotating High-Throughput Genome-wide Screens

Position Paper for W3C Workshop on Semantic Web for Life Sciences, October 2004

Authors:: Josh Moore (DKFZ) <j.moore@dkfz.de>; Stefan Frank (DKFZ) <s.frank@dkfz.de>; Roland Eils (DKFZ) <r.eils@dkfz.de>
This Version:: http://www.dkfz.de/ibios/publications/2004/SW-LS-Annotea-20040910/
Latest Version:: http://www.dkfz.de/ibios/publications/2004/SW-LS-Annotea/

But we're trying. The European Commission's Sixth Framework Programme (http://europa.eu.int/comm/research/fp6/index_en.html) is sponsoring a 4 year project–MitoCheck, http://www.mitocheck.org)–to examine mitosis in its chemical-biological, proteomic, and genomic aspects. Part of this work includes four genome-wide screens–each about 8 terabytes of 3D movies–using "ribonucleic acid interference" (RNAi) and live cell microscopy to decode the effect of gene knock-outs on cellular mitotic phenotypes.

At the same time, the German National Genomic Research Network (NGFN, http://www.ngfn.de) is funding a supplementary project–SMP-RNAi (http://www.pt-it.de/ngfn/pd/show_details.php?id=63)–to take the same data and evaluate the disease-related phenotypes which emerge from these controlled genotypic mutations. Our group is responsible for the bioinformatics infrastructure to automatically classify phenotypes from these videos.

We have joined the Open Microscopy Environment (OME, http://www.openmicroscopy.org), an effort to develop a database-driven system for the quantitative analysis of biological images. At the core of the OME project is an object model for microscopic images available as XML Schema [OME-XSD Available at: http://www.openmicroscopy.org/api/xml/] with support for Life Science Identifiers (LSIDs, http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02).

We are now extending the existing OME annotation framework to permit the attachment of ontologically-defined concepts to LSID entities. Our goal is the long-term, large-scale correlation of phenotypes and genotypes, which, we think, fits nicely with the goal statement of the workshop.

Below we briefly outline some of the steps leading to such a framework, some of which we are actively pursuing, for some of which we are seeking collaboration, and on some of which we would like to see further standardization work.

Steps

Define need

Our primary goal is storing our microscopic data and our knowledge about that data in a meaningfully linked way. Information from image acquisition parameters, log books, automatic and manual evaluation, segmentation, etc. should be made available with explicit semantics. This information should be accessible by both humans and machines with standards-compliant and open-source tools.

Choose technologies

These needs brought us to the Semantic Web. We'll assume that the reasons for choosing RDF ( Resource Description Framework, http://www.w3.org/RDF/), OWL, and LSID–the trio composing our original proposal–are obvious for anyone attending the workshop. However, provenance data, authorship, versioning, trust and other properties of relationships which would be useful in such a framework would have been difficult. To overcome this, we chose to link all statements to a container, an annotation, which had as its subject an LSID and as its object an OWL concept.

This basic pattern is described in ["Defining N-ary Relations..."(Pattern 2) Available at: http://www.w3.org/TR/swbp-n-aryRelations/], and is well represented by the Annotea model. Annotea as a method for grouping statements has the added benefits of available servers and defined protocols. This choice of technologies also permits the use of several existing clients including Protege for authoring OWL (and eventually annotations), Amaya, Annozilla, Annogates, Janno, etc. for reading annotations, and the LSID LaunchPad for viewing LSIDs.

A typical use-case

After running an algorithm over an entire genome-wide screen (~30,000 movies) to detect phenotypic classes , a biologist poses a query: "what genes lead to a certain phenotype (death, mitosis, mutation, etc.)? " or roughly:

      select ?gene where

           (?annotation1, a:annotates, ?image) 
           (?annotation1, rdf:type,    ex:Phenotype) 
           (?annotation1, a:body,      ex:somePhenotype) 

           (?annotation2, a:annotates, ?image) 
           (?annotation2, rdf:type,    ex:KnockedOutGene) 
           (?annotation2, a:body,      ?gene) ;

where ex:Phenotype and ex:KnockedOutGene are subclasses of a:Annotation.

This returns a list of gene LSIDs which are then viewed in the LSID LaunchPad. Further, while looking at this LSID, the user adds manual annotations–possibly links to other sources of information–through an Annotea client.

Primary extensions

OWL in Annotea

The core of the framework is a system of Annotea annotation subtypes. We need specific body types which have as their range not just HTML, XML, or text but particular ontological concepts.

We will be working to integrate OWL and Annotea and develop a system of Annotea subclasses with specific meanings for our domain. Collaboration to produce generally useful subclasses is certainly welcomed.

XPointer for LSID

OWL statements alone are not enough. Currently, Annotea defines its "contexts" with XPointer (http://www.w3.org/TR/WD-xptr). To fully use the Annotation schema on LSIDs, we need to point to specific sub-LSIDs within the context of an LSID. This means XPointer-funtionality over the RDF metadata of an LSID.

For this, further work within the community is needed.

Extending the addressing Schema for Annotea

Further, we would like to see a general strengthening of the Annotea addressing schema. In our area, SVG outlines would be of obvious benefit.

Further extensions

Trust in Annotea

One of the main reasons for using the containers for the annotations is to know who said what when. It seems there's a lot of talk going on about getting trust into SW systems, but a best practice is still unclear.

Currently we are simply restricting write access to the annotation database to trusted individuals and enabling user/group based queries. A better solution, however, would be a system of trust annotations. Related efforts include those of WOT (http://xmlns.com/wot/0.1/), Rdf Bookmark (http://web.sfc.keio.ac.jp/~kaz/www2004/slides/ns/), and the former MedCERTAIN (http://www.medcertain.org/), now MedCIRCLE (http://www.medcircle.org/).

Retractions in Annotea

Also vital is the ability to retract a statement already made, whether it has been proved false or superceded.

Simply deleting annotations is certainly less than optimal, but obvious fallacies must quickly be taken care of. Retraction should work together with the trust system for eventually allowing "proofs". Naturally, much community agreement needs to take place.

New and modified clients

Annotea in Protege

Having added OWL classes to Annotea annotations, Protege is a possible authoring client. A plugin will need to be created which can log itself into the server and download a Protege project with the necessary ontologies imported and pre-formatted forms for ease of use.

LSID in Mozilla

A gap that needs to be filled for the acceptance of LSIDs is a LaunchPad for open-source browsers. We would like to see an LSID plugin developed for Mozilla. Starting in 2005, we will hopefully have two interns dedicated to this work.

Cooperation between Annotea and LSID plugins

Further, it would obviously be beneficial if the plugins for Annotea and LSID knew how to interact. This involves the Annotea client knowing what a "lsidres:urn:lsid:..." URL means as well as how to annotate the metadata that is resolved. In general, this involves viewing LSIDs as a generic web resource. See also "XPointer for LSID" above.

Making the framework available

Even before many of these extensions take place, we will begin the work of manually and automatically annotating the genomic screens. Once complete, the primary data and annotations will be made available via the web and LSID. If we've managed to solve the trust issues by then, the public will also be able to store their annotations on the servers.

The framework itself will also be made available, along with tools for viewing and authoring annotations. We would like to see this domain-neutral framework implemented in other domain areas and would look forward to discussion in the workshop.

The Future

Certainly this and any Semantic Web project will need to keep up with changing technology. As the questions regarding reification and named graphs are resolved, the Annotea scheme should be adapted to take advantage of these capabilities. We will also be keeping an eye on what happens to the Semantic Web Rule Language (SWRL, http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/) recommendation.

A perhaps far-reaching vision for such a framework is a shared or even global knowledge-base for biomedical data with support for versioning, security, and trust. That, of course, will have to wait to another workshop.

(Annotea, OWL, LSID) A Framework for Annotating High-Throughput Genome-wide Screens