Towards a semantic web for chemistry in lifescience

Peter Murray-Rust, University of Cambridge, UK, pm286@cam.ac.uk
Henry S. Rzepa, Imperial College London, UK, h.rzepa@imperial.ac.uk

Position

Chemistry is a fundamental component of lifescience, both in understanding and application (e.g. new drugs, methodology, etc.). It is noteworthy that many major bioinformatics sites include databases and other resources that rely heavily on chemical information. These include:

molecules of biochemical importance.
biochemical pathways.
chemical reactions.
safety and toxicity data.

Many enzymatic and signalling processes are now explored in silico as well as in vitro. This position paper explores how a chemical semantic web can be constructed to support lifesciences.

Chemical Markup Language - CML

CML was the first scientific markup language¹ and has evolved to become the de facto approach to representation of chemistry and molecules in XML. Its evolution has been heavily informed by the development of W3C technologies and protocols. Currently CML supports:

molecules.
reactions.
spectra.
computational process.
crystals and condensed matter.
physical quantities.

CML is not intended to be a universal language and has been designed to interoperate with other projects such as:

ThermoML² for thermochemistry.
AniML³ for analytical chemistry.

CML also re-uses MathML and SVG.<

Many organisations, including publishers and data repositories have adopted CML. They include the European Patents Office, the US National Cancer Institute and the BIOPAX consortium.

We anticipated the importance of the Semantic Web in chemistry^4,5, showing the value of RDF in chemistry but predated the widespread deployment of OWL and RDFS.

Ontological Structure of Chemistry

In many areas chemistry is a well-understood, stable, framework and many concepts of value to biosciences are over 80 years old. They include:

molecular structure.
quantum mechanics.
reactions and their mechanisms and kinetics
thermodynamics.
Electron transport and photochemistry

From these foundations, we are currently constructing an extensible framework for the ontological support. Amongst the concepts driving this are a recognition that much physical science requires computation, and that this is best supported in procedural languages. Data representation is still poorly formalised and here CML is acting as a way of coordinating efforts in standardisation (on which any further ontological development must depend). Although we believe that OWL/RDF offers great benefits to the (bio)chemical community, it is also clear that it is not a universal solution and that a multi-component approach will be required.

We emphasize that ontological application of CML will require extensive software development for each concept. Thus <>molecule</> has several hundred (Java) methods including calculation of properties, housekeeping, management of components etc. Although the idea of representing this in RDF (suggested by TimBL some years ago) is enticing, the infrastructure is not yet present - matrices still need diagonalizing procedurally.

There was a highly valuable BOF session on ontologies at the recent UK eScience Allhands meeting. It was clear that the levels of ontological support that disciplines needed and could support was domain-dependent. Informed from that, we are developing the following hierarchy for Chemical Markup Language and related ontologies.

CML Schema. CML is modular and we experimented with a namespaced design for the major disciplines. However the borderlines are so variable that we split this into individual components (about 100 XML Elements and about 100 attributes). See http://wwmm.ch.cam.ac.uk/moin/ChemicalMarkupLanguage for examples (molecule, atom, bond, reaction, etc.). Because of the robustness of chemistry these components can be separated and their ontology is largely context-independent (e.g. the same software can process a molecule independent of the rest of the information with which it co-occurs).

We have therefore created a "build-your-own" schema approach where a community choose just those CMLElements (and any local ones) that it needs and assembles a schema on-the-fly. This schema is then automatically used to generate Java, Python, C++ (and Fortran90) so that applications can be assembled.
Dictionaries. Much chemistry and physical science in general can be represented by general data structures (scalar, array, matrix) and data types (float, integer) with accompanying error estimates and scientific units. Each concept is then described by a dictionary entry to which a data instance can be linked. Thus "melting point" is represented by a scalar and a link to a namespaced dictionary entry (e.g. cml:mpt). This concept has been tested for enzyme reactions and computational chemistry. The dictionaries, in XML, are being extended to include XML Schema-like mechanisms for on-the-fly validation.
RDF/OWL. This is being developed for:
- maintenance of the system, especially the dictionaries.
- management of metadata and history as a job travels through the workflow.
- CMLRSS - and extension of RSS/RDF to include chemical objects.
- Reasoning about fuzzy properties of objects (e.g. should a molecule include hydrogens, stereochemistry, etc.)

Summary

We expect the chemical semantic web to be layered, with XMLSchema as its fundamental implementation and delivery tool. Ontology will be added through a mixture of XMLSChema-like dictionaries and RDF/OWL.

Murray-Rust, P. and Rzepa, H. S., J. Chem. Inf. Comp. Sci., 1999, 39, 928; ibid, 2003, 43, 757-772. See http://cml.sourceforge.net/
See http://trc.nist.gov/ThermoML-supporting-doc.pdf
See http://animl.sourceforge.net/
G. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, "Chemical Markup, XML and the World-Wide Web. Part III: Towards a signed semantic Chemical Web of Trust", J. Chem. Inf. Comp. Sci., 2001, 41, 1124.
P. Murray-Rust, H. S. Rzepa, M. J. Williamson and E. L. Willighagen, "Chemical Markup, XML and the Worldwide Web. Part 5. Applications of Chemical Metadata in RSS 1.0 Aggregators", J. Chem. Inf. Comp. Sci., 2004, 44, 462-469.