"...and we certainly can't take an existing blueprint and understand how changes in that blueprint affect an organism...in the course of a disease."
-Lewis-Sigler Institute for Integrative Genomics, "Overview"
But we're trying. The European Commission's Sixth Framework Programme (http://europa.eu.int/comm/research/fp6/index_en.html) is sponsoring a 4 year project–MitoCheck, http://www.mitocheck.org)–to examine mitosis in its chemical-biological, proteomic, and genomic aspects. Part of this work includes four genome-wide screens–each about 8 terabytes of 3D movies–using "ribonucleic acid interference" (RNAi) and live cell microscopy to decode the effect of gene knock-outs on cellular mitotic phenotypes.
At the same time, the German National Genomic Research Network (NGFN, http://www.ngfn.de) is funding a supplementary project–SMP-RNAi (http://www.pt-it.de/ngfn/pd/show_details.php?id=63)–to take the same data and evaluate the disease-related phenotypes which emerge from these controlled genotypic mutations. Our group is responsible for the bioinformatics infrastructure to automatically classify phenotypes from these videos.
We have joined the Open Microscopy Environment (OME, http://www.openmicroscopy.org), an effort to develop a database-driven system for the quantitative analysis of biological images. At the core of the OME project is an object model for microscopic images available as XML Schema [OME-XSD Available at: http://www.openmicroscopy.org/api/xml/] with support for Life Science Identifiers (LSIDs, http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02).
We are now extending the existing OME annotation framework to permit the attachment of ontologically-defined concepts to LSID entities. Our goal is the long-term, large-scale correlation of phenotypes and genotypes, which, we think, fits nicely with the goal statement of the workshop.
We will bring to the workshop and the SW-LS community:
Below we briefly outline some of the steps leading to such a framework, some of which we are actively pursuing, for some of which we are seeking collaboration, and on some of which we would like to see further standardization work.
Our primary goal is storing our microscopic data and our knowledge about that data in a meaningfully linked way. Information from image acquisition parameters, log books, automatic and manual evaluation, segmentation, etc. should be made available with explicit semantics. This information should be accessible by both humans and machines with standards-compliant and open-source tools.
These needs brought us to the Semantic Web. We'll assume that the reasons for choosing RDF ( Resource Description Framework, http://www.w3.org/RDF/), OWL, and LSID–the trio composing our original proposal–are obvious for anyone attending the workshop. However, provenance data, authorship, versioning, trust and other properties of relationships which would be useful in such a framework would have been difficult. To overcome this, we chose to link all statements to a container, an annotation, which had as its subject an LSID and as its object an OWL concept.
This basic pattern is described in ["Defining N-ary Relations..."(Pattern 2) Available at: http://www.w3.org/TR/swbp-n-aryRelations/], and is well represented by the Annotea model. Annotea as a method for grouping statements has the added benefits of available servers and defined protocols. This choice of technologies also permits the use of several existing clients including Protege for authoring OWL (and eventually annotations), Amaya, Annozilla, Annogates, Janno, etc. for reading annotations, and the LSID LaunchPad for viewing LSIDs.
After running an algorithm over an entire genome-wide screen (~30,000 movies) to detect phenotypic classes , a biologist poses a query: "what genes lead to a certain phenotype (death, mitosis, mutation, etc.)? " or roughly:
select ?gene where (?annotation1, a:annotates, ?image) (?annotation1, rdf:type, ex:Phenotype) (?annotation1, a:body, ex:somePhenotype) (?annotation2, a:annotates, ?image) (?annotation2, rdf:type, ex:KnockedOutGene) (?annotation2, a:body, ?gene) ;
where
ex:Phenotype
andex:KnockedOutGene
are subclasses ofa:Annotation
.This returns a list of gene LSIDs which are then viewed in the LSID LaunchPad. Further, while looking at this LSID, the user adds manual annotations–possibly links to other sources of information–through an Annotea client.
The core of the framework is a system of Annotea annotation subtypes. We need specific body types which have as their range not just HTML, XML, or text but particular ontological concepts.
We will be working to integrate OWL and Annotea and develop a system of Annotea subclasses with specific meanings for our domain. Collaboration to produce generally useful subclasses is certainly welcomed.
OWL statements alone are not enough. Currently, Annotea defines its "contexts" with XPointer (http://www.w3.org/TR/WD-xptr). To fully use the Annotation schema on LSIDs, we need to point to specific sub-LSIDs within the context of an LSID. This means XPointer-funtionality over the RDF metadata of an LSID.
For this, further work within the community is needed.
Further, we would like to see a general strengthening of the Annotea addressing schema. In our area, SVG outlines would be of obvious benefit.
One of the main reasons for using the containers for the annotations is to know who said what when. It seems there's a lot of talk going on about getting trust into SW systems, but a best practice is still unclear.
Currently we are simply restricting write access to the annotation database to trusted individuals and enabling user/group based queries. A better solution, however, would be a system of trust annotations. Related efforts include those of WOT (http://xmlns.com/wot/0.1/), Rdf Bookmark (http://web.sfc.keio.ac.jp/~kaz/www2004/slides/ns/), and the former MedCERTAIN (http://www.medcertain.org/), now MedCIRCLE (http://www.medcircle.org/).
Also vital is the ability to retract a statement already made, whether it has been proved false or superceded.
Simply deleting annotations is certainly less than optimal, but obvious fallacies must quickly be taken care of. Retraction should work together with the trust system for eventually allowing "proofs". Naturally, much community agreement needs to take place.
Having added OWL classes to Annotea annotations, Protege is a possible authoring client. A plugin will need to be created which can log itself into the server and download a Protege project with the necessary ontologies imported and pre-formatted forms for ease of use.
A gap that needs to be filled for the acceptance of LSIDs is a LaunchPad for open-source browsers. We would like to see an LSID plugin developed for Mozilla. Starting in 2005, we will hopefully have two interns dedicated to this work.
Further, it would obviously be beneficial if the plugins for Annotea and LSID knew how to interact. This involves the Annotea client knowing what a "lsidres:urn:lsid:..." URL means as well as how to annotate the metadata that is resolved. In general, this involves viewing LSIDs as a generic web resource. See also "XPointer for LSID" above.
Even before many of these extensions take place, we will begin the work of manually and automatically annotating the genomic screens. Once complete, the primary data and annotations will be made available via the web and LSID. If we've managed to solve the trust issues by then, the public will also be able to store their annotations on the servers.
The framework itself will also be made available, along with tools for viewing and authoring annotations. We would like to see this domain-neutral framework implemented in other domain areas and would look forward to discussion in the workshop.
Certainly this and any Semantic Web project will need to keep up with changing technology. As the questions regarding reification and named graphs are resolved, the Annotea scheme should be adapted to take advantage of these capabilities. We will also be keeping an eye on what happens to the Semantic Web Rule Language (SWRL, http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/) recommendation.
A perhaps far-reaching vision for such a framework is a shared or even global knowledge-base for biomedical data with support for versioning, security, and trust. That, of course, will have to wait to another workshop.