Towards a semantic web for chemistry in lifescience

Peter Murray-Rust, University of Cambridge, UK, pm286@cam.ac.uk
Henry S. Rzepa, Imperial College London, UK, h.rzepa@imperial.ac.uk

Position

Chemistry is a fundamental component of lifescience, both in understanding and application (e.g. new drugs, methodology, etc.). It is noteworthy that many major bioinformatics sites include databases and other resources that rely heavily on chemical information. These include:

Many enzymatic and signalling processes are now explored in silico as well as in vitro. This position paper explores how a chemical semantic web can be constructed to support lifesciences.

Chemical Markup Language - CML

CML was the first scientific markup language1 and has evolved to become the de facto approach to representation of chemistry and molecules in XML. Its evolution has been heavily informed by the development of W3C technologies and protocols. Currently CML supports:

CML is not intended to be a universal language and has been designed to interoperate with other projects such as:

CML also re-uses MathML and SVG.<

Many organisations, including publishers and data repositories have adopted CML. They include the European Patents Office, the US National Cancer Institute and the BIOPAX consortium.

We anticipated the importance of the Semantic Web in chemistry4,5, showing the value of RDF in chemistry but predated the widespread deployment of OWL and RDFS.

Ontological Structure of Chemistry

In many areas chemistry is a well-understood, stable, framework and many concepts of value to biosciences are over 80 years old. They include:

From these foundations, we are currently constructing an extensible framework for the ontological support. Amongst the concepts driving this are a recognition that much physical science requires computation, and that this is best supported in procedural languages. Data representation is still poorly formalised and here CML is acting as a way of coordinating efforts in standardisation (on which any further ontological development must depend). Although we believe that OWL/RDF offers great benefits to the (bio)chemical community, it is also clear that it is not a universal solution and that a multi-component approach will be required.

We emphasize that ontological application of CML will require extensive software development for each concept. Thus <>molecule</> has several hundred (Java) methods including calculation of properties, housekeeping, management of components etc. Although the idea of representing this in RDF (suggested by TimBL some years ago) is enticing, the infrastructure is not yet present - matrices still need diagonalizing procedurally.

There was a highly valuable BOF session on ontologies at the recent UK eScience Allhands meeting. It was clear that the levels of ontological support that disciplines needed and could support was domain-dependent. Informed from that, we are developing the following hierarchy for Chemical Markup Language and related ontologies.

Summary

We expect the chemical semantic web to be layered, with XMLSchema as its fundamental implementation and delivery tool. Ontology will be added through a mixture of XMLSChema-like dictionaries and RDF/OWL.

  1. Murray-Rust, P. and Rzepa, H. S., J. Chem. Inf. Comp. Sci., 1999, 39, 928; ibid, 2003, 43, 757-772. See http://cml.sourceforge.net/
  2. See http://trc.nist.gov/ThermoML-supporting-doc.pdf
  3. See http://animl.sourceforge.net/
  4. G. Gkoutos, P. Murray-Rust, H. S. Rzepa and M. Wright, "Chemical Markup, XML and the World-Wide Web. Part III: Towards a signed semantic Chemical Web of Trust", J. Chem. Inf. Comp. Sci., 2001, 41, 1124.
  5. P. Murray-Rust, H. S. Rzepa, M. J. Williamson and E. L. Willighagen, "Chemical Markup, XML and the Worldwide Web. Part 5. Applications of Chemical Metadata in RSS 1.0 Aggregators", J. Chem. Inf. Comp. Sci., 2004, 44, 462-469.