RE: Chemistry and the Semantic Web from Peter Murray-Rust on 2004-07-01 (public-semweb-lifesci@w3.org from July 2004)

From: Peter Murray-Rust <pm286@cam.ac.uk>
Date: Thu, 01 Jul 2004 08:55:26 +0100
To: <Eric.Neumann@aventis.com>, <Eric.Jain@isb-sib.ch>, <public-semweb-lifesci@w3.org>
Cc: h.rzepa@ic.ac.uk
Message-Id: <5.1.1.6.0.20040701080219.02b00ee0@pop.hermes.cam.ac.uk>
At 14:45 30/06/2004 -0400, Eric.Neumann@aventis.com wrote:



Greetings,

I am excited to see the discussion on Chemistry and the Semantic Web and 
hope that this can read to some real progress.

When we (Henry Rzepa and I) developed Chemical Markup Language (CML - 
please use this precise acronym) 10 years ago we saw the  potential for 
semantic processing. CML was one of the very first non-textual 
content-based markup languages and as such requires its own toolset for 
authoring, editing, transformation, databases and much else. About 10 years 
ago we showed that MIME helper applications could add valuable semantic to 
chemical information and urged the use of SGML (and now XML).

CML is now a mainstream scientific XML language. Last week I was invited to 
present to the NSF/NationalScienceDigitalLibrary meeting in Washington 
which looked to support the interoperability of major multidisciplinary 
languages (primarily physical sciences). Forward-looking publishers now see 
XML (and thereby these markup languages) as the main way forward for rich 
re-usable content.

We have continued to develop CML technology and investigated most of the 
mature W3C technologies and their applicability to chemistry. These include:
- DTD
- XSLT
- XMLSchema
- RDF
- RSS
- XMLSignature
- namespaces
- SVG

We now have components deployed and can create examples of the Chemical 
Semantic Web.  Two examples:
- CML Rss (http://wwmm.ch.cam.ac.uk/moin/CmlRss). This allows 
authors/publishers to create an RSS feed where items include CML. We have 
developed a CML-sensitive RSS client with a complete download kit and we 
urge you all to have a look! This has enormous potential for chemical 
publishing - a publisher can include complete information on a compound - 
2D structure, 3D structure, properties, etc. The client can display, 
aggregate or filter this. We especially thank Timo Hannay and Ben Lund of 
Nature Publishing Group for showing us the proper way of adding RDF. Their 
urchin (http://urchin.sf.net) is capable of aggregating many such feeds. If 
authors adopted this then chemistry could become on of the first real 
semantic web applications
- WWMM (world wide molecular matrix). http://wwmm.ch.cam.ac.uk/Bob  In this 
vision we tackle the problem of creating a knowledge resource for 
chemistry. Users can donate micro-information (e.g. a single structure for 
which they get a free high-quality calculation of molecular properties 
using quantum mechanics methods. The only stipulation is that the results 
are then Openly available to the community (in CML). In this way the world 
can gain grow a semantic resource.

These are only two examples of possible semantically enhanced web 
technologies. Both could be deployed within enterprises.

Technically, therefore, the chemical semantic web is already 
possible.Unfortunately there is a major cultural  problem, which I hope 
isn't out of scope on this list. (BTW I spent may years working in pharma 
industry (Glaxo) and shall particularly comment on the potential and 
difficulties there)

In our opinions the first generation of the semantic web will be built on 
open systems on the public Internet. For rapid development the resources 
must be open. (Current robots give up when asked to register for a site, 
add their names and emails, etc.). In biosciences this has been spectacular 
and a robot can access and re-use a wide range of high quality and 
comprehensive data (genomes, sequences, structures, etc.) Much of the 
full-text literature is now being made Openly available and bioscience 
publishers are looking at new publication models. This would allow the 
primary literature to become a primary knowledge base for the semantic web.

In chemistry almost all information is "owned". An author signs over the 
copyright to the publisher which usually forbids them to post the full text 
on the Web, even years after publication. Thus Henry and I cannot send you 
the text of our manuscripts on the Chemical Semantic Web - you will have to 
subscribe to the American Chemical Society. My views on the 
inappropriateness of this are well publicised, but I believe that copyright 
should be honoured. Similarly all data is "owned" by secondary data 
producers such as Chemical Abstracts, Beilstein , Derwent, Thomson , etc. 
In the days of paper this had to be created by manual rekeying, but we now 
have tools that can read and understand primary chemical publications. 
Whether they can be legally deployed and for what purpose is unclear. It 
would seem that a low cost human may extract and key information but it is 
less clear whether a robot can do the same. But this has to be where the 
future lies.

The chemical semantic web requires tools. The markup languages for math, 
geography, etc. are supported by commercial suppliers. Chemical software 
producers have been very slow to adopt XML ("there is no market for it") 
and prefer byzantine proprietary formats as a way of maintaining market 
share. We have therefore had to develop our own on a communal voluntary 
basis - all main Open Source chemical tools promote CML. There are signs 
that the commercial manufacturers are starting to see the value of CML, but 
they do this on a sporadic and uncoordinated basis. For example the OMG 
lifesciences effort had ca 50 members in the biosciences and ca 2.5 in 
chemistry. This effort was based on CML but we are normally expected to 
contribute to this development at our own expense.

The pharmaceutical industry has been slow to see the value of XML as a 
communal infrastructure - I have worked with WHO and FDA who see the 
potential and have been trying to develop this approach. For example the 
dossier that the FDA requires for a new drug runs to millions of page 
equivalents and needs to integrate information from many sources. XML is 
essential and valuable for this, but progress is slow.

The main visionaries for CML include:
- government (Patents, drug regulatory, environment, safety)
- health (drug discovery, safety)
- a few publishers (often outside chemistry)
- generic web technologists (e.g. IT companies)

It is disappointing how little interest there has been from pharma. I don't 
know how widely this list is read, but let me make a plea for a concerted 
effort here. I have worked with several other industries (media , finance, 
energy, aerospace) where the need for a communal information infrastructure 
at a precompetitive level is widely recognised.  XML is the universal 
approach.

IMO the pharma industry has many areas where it desperately needs a 
communal information infrastructure. This include regulatory, drug 
information and safety, basic ADMETox data, etc. I don't even know of a 
site where I can get reliable information on current marketed drugs which I 
can re-use without having to pay or violate copyright.

Minor comments follow:


>Eric,
>
> From my understanding of XPaths 
> (http://www.w3.org/TR/xpath#section-Introduction), they can be used 
> "within" URIs. So mapping an RDF statement to a specific Chem-XML node or 
> group should be doable.


CMLRSS already does this for CML (sic).


>Eric.Neumann@aventis.com wrote:
> > Question: How would one apply RDF for such cases? Would one use CML
> > (chemical markup language) to describe the chemical structure and have
> > an RDF statement refer to part of that doc via XPath/XPointers?

Yes

>  How
> > about other structural formats like SMILE and CHUCKLES?

These are semantically void in an XML environment. There is no mechanism 
for discovering their semantics and there is virtually no non-commercial 
software for their re-use.

>  Would the
>This is an interesting question, and certainly also relevant to any
>classical bioinformatics data sources that contain more quantitative
>than qualitative data (e.g. 3D structures, 2D gel images and microarray
>data). I don't really have any solutions, just some ideas:

I believe that biosciences can and should take the lead here by compiling 
source of high quality structure/property databases for molecules of 
interest to bioscience.

>In those cases where it is possible to embed identifiers in the data,
>these could be referenced with identifiers such as
>urn:lsid:foo.org:bar:10. A resolution server can then be set up to
>extract the referenced data when required. Note that the original format
>need not contain full LSIDs.

We have a mechanism for embedding and referencing identifiers. The main 
task is to have public repositories of agreed identifiers.

P


Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069
Received on Wednesday, 7 July 2004 14:50:13 UTC