Molecules are central to many areas of bioscience including reaction pathways, enzyme mechanisms, effectors, and metabolism. Many bioscientific domains use data on "small" molecules such as metabolites, hormones, effectors, etc. The target of much endeavour in the heathcare and pharmaceutical communities is to discover new molecules which have potential therapeutic value. The lack of predictive methods for e.g. ADMETox is a major cause for failure in the development process. Although over 20 million small molecules have been published in the primary literature, the available Open resources in mainstream chemistry cover only a very small fraction of these. It is noticeable that the lifescience community has led the way in aggregating Open data (NCI-DTP data, "PubChem", ebichem, KEGG, etc.) This position paper addresses the systematic identification of small molecules in lifesciences and the potential incorporaration into an Life Sciences identifier (LSID).
Most small molecules of interest to lifesciences are covalent organic molecules and these are well covered by the approach outlined below. We generally exclude macromolecules such as proteins, polynucleotides and heterogeneous polysaccharides, but can include peptides,oligonucleotides, oligosaccharides and lipids. Almost all the specific substrates in, say, the KEGG or BRENDA data bases will be covered.
The normal methods for identifying chemical substances include:
Most covalent molecules are well defined by a connection table, a labelled graph of the atoms (nodes) and the conventional bonds (edges) between them. The atoms are labelled with elementType, and optionally formalCharge and hydrogenCount. The bonds have a formal order (single, double, etc.) 3D structures are not required (but are easily accommodated using CML (being an XML-conforming Chemical Markup Language)1.
For many molecules the position of bonds, hydrogen atoms, formal charges and the values of bond orders are unequivocal (ethanol will always be written CH3-CH2-OH). For other molecules, however, there are a variety of alternatives:
This means that conventional representations have a degree of arbitrariness and cannot act as unique descriptors.
In addition many molecules have stereo isomers, where the graphs are identical but the geometrical arrangement at atomic or bond centres differs. Tartaric acid, with 2 stereo centres, has famously 3 isomers (D, L and meso). There is an added complication that a sample may contain only a single enantiomer (D tartaric acid or L tartaric acid) or an equimolar mixture (DL-tartaric ecid). For moleculaes the exact representation of stereochemistry is now being addressed.
The lack of an Open system for representing molecules leads to sloppy semantics. Hydrogen atoms are often omitted as "it is obvious where to add them". Stereochemistry is very frequently omitted and often it is impossible to decide what the molecule is.
The objective of the IUPAC/NIST Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources. Such an identifier would enable easier linking of diverse data compilations.2
INChI is described at: http://www.iupac.org/projects/2000/2000-025-1-800.html. There have been regular meetings of interested parties with the parallel creation of a program implementing the recommendations, with extensive testing from the authors and elsewhere. Over 500,000 molecules and fragments have so far been submitted, with an almost zero software failure rate.
The INChI program carries out the following:
The INChI is a layered identifier and consists of the following components:
The complete identifier for a molecule is the combination of all of these (some of which may be empty). However for molecules with incomplete information (e.g. without stereochemistry) an INChI can still be constructed which represents our complete knowledge of the species.
The INChI is ideally suited for indexing molecules in databases. As an illustration of this, we have transformed the National Cancer Institute's DTP database into CML and indexed this by INChI (http://wwmm.ch.cam.ac.uk/moin/HotTopics lists various interactive services, including creating INChIs). To search for a desired molecule the INChI can be constructed from a graphical entry or conversion of a traditional connection table. Lookup is rapid (milliseconds).
The IUPAC/NIST INChI is offered as a component for LSIDs representing small molecules of importance to life sciences. All software is open.
We acknowledge the contributions of many members of the community, including Alan McNaught and Steve Heller. In particular Dmitrii Tchekhovskoi has implemented the complete INChI algorithms in a batch and graphical environment. CML input has been added by Simon Tyrrell.