The INChI as an LSID for molecules in lifescience

Peter Murray-Rust, University of Cambridge, UK, pm286@cam.ac.uk
Henry Rzepa, Imperial College London, UK, h.rzepa@imperial.ac.uk
Steve Stein, National Institute of Standards and Technology, US, steve.stein@nist.gov

Position

Molecules are central to many areas of bioscience including reaction pathways, enzyme mechanisms, effectors, and metabolism. Many bioscientific domains use data on "small" molecules such as metabolites, hormones, effectors, etc. The target of much endeavour in the heathcare and pharmaceutical communities is to discover new molecules which have potential therapeutic value. The lack of predictive methods for e.g. ADMETox is a major cause for failure in the development process. Although over 20 million small molecules have been published in the primary literature, the available Open resources in mainstream chemistry cover only a very small fraction of these. It is noticeable that the lifescience community has led the way in aggregating Open data (NCI-DTP data, "PubChem", ebichem, KEGG, etc.) This position paper addresses the systematic identification of small molecules in lifesciences and the potential incorporaration into an Life Sciences identifier (LSID).

Identification of molecules

Most small molecules of interest to lifesciences are covalent organic molecules and these are well covered by the approach outlined below. We generally exclude macromolecules such as proteins, polynucleotides and heterogeneous polysaccharides, but can include peptides,oligonucleotides, oligosaccharides and lipids. Almost all the specific substrates in, say, the KEGG or BRENDA data bases will be covered.

The normal methods for identifying chemical substances include:

registry number. Most collections have a semantically void identifier, which requires a naming authority. The best known is the CAS number from Chemical Abstracts but this is not Open. Most molecular collections in lifesciences have created their own local schemes independently and there are few, if any, mappings between them.
name. There are many ways of creating valid names for a given compound. Thus "isopropanol" can also by "isopropyl alcohol", "2-hydroxy propane" etc. In addition there are many trivial, trade proprietary names. Most lists of names are not Open. Although in principal the systematic IUPAC (International Union of Pure and Applied CHemistry) name should be unique, it is rarely used outside patent and other regulatory submissions
chemical formula. e.g. C₃H₈O. This is valuable, but as almost all molecules have isomers of one form or another, this alone is not sufficient.
connection table, in which the explicit connectivity between the individual atoms of a molecule is declared. This is the approach described here.

Molecular connection tables

Most covalent molecules are well defined by a connection table, a labelled graph of the atoms (nodes) and the conventional bonds (edges) between them. The atoms are labelled with elementType, and optionally formalCharge and hydrogenCount. The bonds have a formal order (single, double, etc.) 3D structures are not required (but are easily accommodated using CML (being an XML-conforming Chemical Markup Language)¹.

For many molecules the position of bonds, hydrogen atoms, formal charges and the values of bond orders are unequivocal (ethanol will always be written CH₃-CH₂-OH). For other molecules, however, there are a variety of alternatives:

The positions of multiple bonds can be arbitrary (e.g. as in valence or resonance isomers, aromatic systems)
Some hydrogen atoms are mobile (tautomerism). Thus aldehydes can often exist as HR-CH(=O) or R=CH(OH).
formal charges can be redistributed. Guanidinium can be written NH2-C[+](NH2)-NH2 or NH2[+]=C(NH2)-NH2.
Some species (e.g. sugars) can exist as an equilibrium of species.

This means that conventional representations have a degree of arbitrariness and cannot act as unique descriptors.

In addition many molecules have stereo isomers, where the graphs are identical but the geometrical arrangement at atomic or bond centres differs. Tartaric acid, with 2 stereo centres, has famously 3 isomers (D, L and meso). There is an added complication that a sample may contain only a single enantiomer (D tartaric acid or L tartaric acid) or an equimolar mixture (DL-tartaric ecid). For moleculaes the exact representation of stereochemistry is now being addressed.

The lack of an Open system for representing molecules leads to sloppy semantics. Hydrogen atoms are often omitted as "it is obvious where to add them". Stereochemistry is very frequently omitted and often it is impossible to decide what the molecule is.

The IUPAC/NIST Chemical Identifier, INChI

The objective of the IUPAC/NIST Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources. Such an identifier would enable easier linking of diverse data compilations.²

INChI is described at: http://www.iupac.org/projects/2000/2000-025-1-800.html. There have been regular meetings of interested parties with the parallel creation of a program implementing the recommendations, with extensive testing from the authors and elsewhere. Over 500,000 molecules and fragments have so far been submitted, with an almost zero software failure rate.

The INChI program carries out the following:

reads the molecule input, in CML or MDL Molfile format. Both 2-D (chemical structure diagrams) and 3-D (atomic positions in space) are allowed.
Determine the connections between atoms.
normalize this connection table.
assign a unique numering to the atoms, independently of the input order (canonicalization). The ordering takes account of stereochemistry, isotopic substitution, etc.
serialization to XML and CML.

The INChI is a layered identifier and consists of the following components:

disjoint species.
The molecular formula.
The connectivity (formal bond orders are deliberately omitted).
The hydrogen decoration.
Possible tautomerism.
charge on fragments.
stereochemistry at double bonds.
stereochemistry at pyramidal and tetrahedral stereogenic centres.

The complete identifier for a molecule is the combination of all of these (some of which may be empty). However for molecules with incomplete information (e.g. without stereochemistry) an INChI can still be constructed which represents our complete knowledge of the species.

The INChI is ideally suited for indexing molecules in databases. As an illustration of this, we have transformed the National Cancer Institute's DTP database into CML and indexed this by INChI (http://wwmm.ch.cam.ac.uk/moin/HotTopics lists various interactive services, including creating INChIs). To search for a desired molecule the INChI can be constructed from a graphical entry or conversion of a traditional connection table. Lookup is rapid (milliseconds).

Summary

The IUPAC/NIST INChI is offered as a component for LSIDs representing small molecules of importance to life sciences. All software is open.

Acknowledgement

We acknowledge the contributions of many members of the community, including Alan McNaught and Steve Heller. In particular Dmitrii Tchekhovskoi has implemented the complete INChI algorithms in a batch and graphical environment. CML input has been added by Simon Tyrrell.

Key Literature Citations

Murray-Rust, P. and Rzepa, H. S., J. Chem. Inf. Comp. Sci., 1999, 39, 928; ibid, 2003, 43, 757-772.
Stein, Stephen E.; Heller, Stephen R.; Tchekhovskoi, Dmitrii V. "Toward the development of a standard chemical identifier". Abstracts of Papers, 222nd ACS National Meeting, Chicago, IL, United States, August 26-30, 2001, CINF-005.