Chemistry is a fundamental component of lifescience, both in understanding and application (e.g. new drugs, methodology, etc.). It is noteworthy that many major bioinformatics sites include databases and other resources that rely heavily on chemical information. These include:
CML was the first scientific markup language1 and has evolved to become the de facto approach to representation of chemistry and molecules in XML. Its evolution has been heavily informed by the development of W3C technologies and protocols. Currently CML supports:
CML is not intended to be a universal language and has been designed to interoperate with other projects such as:
CML also re-uses MathML and SVG.<
Many organisations, including publishers and data repositories have adopted CML. They include the European Patents Office, the US National Cancer Institute and the BIOPAX consortium.
We anticipated the importance of the Semantic Web in chemistry4,5, showing the value of RDF in chemistry but predated the widespread deployment of OWL and RDFS.
In many areas chemistry is a well-understood, stable, framework and many concepts of value to biosciences are over 80 years old. They include:
From these foundations, we are currently constructing an extensible framework for the ontological support. Amongst the concepts driving this are a recognition that much physical science requires computation, and that this is best supported in procedural languages. Data representation is still poorly formalised and here CML is acting as a way of coordinating efforts in standardisation (on which any further ontological development must depend). Although we believe that OWL/RDF offers great benefits to the (bio)chemical community, it is also clear that it is not a universal solution and that a multi-component approach will be required.
We emphasize that ontological application of CML will require extensive software development for each concept. Thus <>molecule</> has several hundred (Java) methods including calculation of properties, housekeeping, management of components etc. Although the idea of representing this in RDF (suggested by TimBL some years ago) is enticing, the infrastructure is not yet present - matrices still need diagonalizing procedurally.
There was a highly valuable BOF session on ontologies at the recent UK eScience Allhands meeting. It was clear that the levels of ontological support that disciplines needed and could support was domain-dependent. Informed from that, we are developing the following hierarchy for Chemical Markup Language and related ontologies.
CML Schema. CML is modular and we experimented with a namespaced design for the major disciplines. However the borderlines are so variable that we split this into individual components (about 100 XML Elements and about 100 attributes). See http://wwmm.ch.cam.ac.uk/moin/ChemicalMarkupLanguage for examples (molecule, atom, bond, reaction, etc.). Because of the robustness of chemistry these components can be separated and their ontology is largely context-independent (e.g. the same software can process a molecule independent of the rest of the information with which it co-occurs).
We have therefore created a "build-your-own" schema approach where a community choose just those CMLElements (and any local ones) that it needs and assembles a schema on-the-fly. This schema is then automatically used to generate Java, Python, C++ (and Fortran90) so that applications can be assembled.
We expect the chemical semantic web to be layered, with XMLSchema as its fundamental implementation and delivery tool. Ontology will be added through a mixture of XMLSChema-like dictionaries and RDF/OWL.