- From: Peter Murray-Rust <pm286@cam.ac.uk>
- Date: Thu, 01 Jul 2004 08:55:26 +0100
- To: <Eric.Neumann@aventis.com>, <Eric.Jain@isb-sib.ch>, <public-semweb-lifesci@w3.org>
- Cc: h.rzepa@ic.ac.uk
At 14:45 30/06/2004 -0400, Eric.Neumann@aventis.com wrote:
Greetings,
I am excited to see the discussion on Chemistry and the Semantic Web and
hope that this can read to some real progress.
When we (Henry Rzepa and I) developed Chemical Markup Language (CML -
please use this precise acronym) 10 years ago we saw the potential for
semantic processing. CML was one of the very first non-textual
content-based markup languages and as such requires its own toolset for
authoring, editing, transformation, databases and much else. About 10 years
ago we showed that MIME helper applications could add valuable semantic to
chemical information and urged the use of SGML (and now XML).
CML is now a mainstream scientific XML language. Last week I was invited to
present to the NSF/NationalScienceDigitalLibrary meeting in Washington
which looked to support the interoperability of major multidisciplinary
languages (primarily physical sciences). Forward-looking publishers now see
XML (and thereby these markup languages) as the main way forward for rich
re-usable content.
We have continued to develop CML technology and investigated most of the
mature W3C technologies and their applicability to chemistry. These include:
- DTD
- XSLT
- XMLSchema
- RDF
- RSS
- XMLSignature
- namespaces
- SVG
We now have components deployed and can create examples of the Chemical
Semantic Web. Two examples:
- CML Rss (http://wwmm.ch.cam.ac.uk/moin/CmlRss). This allows
authors/publishers to create an RSS feed where items include CML. We have
developed a CML-sensitive RSS client with a complete download kit and we
urge you all to have a look! This has enormous potential for chemical
publishing - a publisher can include complete information on a compound -
2D structure, 3D structure, properties, etc. The client can display,
aggregate or filter this. We especially thank Timo Hannay and Ben Lund of
Nature Publishing Group for showing us the proper way of adding RDF. Their
urchin (http://urchin.sf.net) is capable of aggregating many such feeds. If
authors adopted this then chemistry could become on of the first real
semantic web applications
- WWMM (world wide molecular matrix). http://wwmm.ch.cam.ac.uk/Bob In this
vision we tackle the problem of creating a knowledge resource for
chemistry. Users can donate micro-information (e.g. a single structure for
which they get a free high-quality calculation of molecular properties
using quantum mechanics methods. The only stipulation is that the results
are then Openly available to the community (in CML). In this way the world
can gain grow a semantic resource.
These are only two examples of possible semantically enhanced web
technologies. Both could be deployed within enterprises.
Technically, therefore, the chemical semantic web is already
possible.Unfortunately there is a major cultural problem, which I hope
isn't out of scope on this list. (BTW I spent may years working in pharma
industry (Glaxo) and shall particularly comment on the potential and
difficulties there)
In our opinions the first generation of the semantic web will be built on
open systems on the public Internet. For rapid development the resources
must be open. (Current robots give up when asked to register for a site,
add their names and emails, etc.). In biosciences this has been spectacular
and a robot can access and re-use a wide range of high quality and
comprehensive data (genomes, sequences, structures, etc.) Much of the
full-text literature is now being made Openly available and bioscience
publishers are looking at new publication models. This would allow the
primary literature to become a primary knowledge base for the semantic web.
In chemistry almost all information is "owned". An author signs over the
copyright to the publisher which usually forbids them to post the full text
on the Web, even years after publication. Thus Henry and I cannot send you
the text of our manuscripts on the Chemical Semantic Web - you will have to
subscribe to the American Chemical Society. My views on the
inappropriateness of this are well publicised, but I believe that copyright
should be honoured. Similarly all data is "owned" by secondary data
producers such as Chemical Abstracts, Beilstein , Derwent, Thomson , etc.
In the days of paper this had to be created by manual rekeying, but we now
have tools that can read and understand primary chemical publications.
Whether they can be legally deployed and for what purpose is unclear. It
would seem that a low cost human may extract and key information but it is
less clear whether a robot can do the same. But this has to be where the
future lies.
The chemical semantic web requires tools. The markup languages for math,
geography, etc. are supported by commercial suppliers. Chemical software
producers have been very slow to adopt XML ("there is no market for it")
and prefer byzantine proprietary formats as a way of maintaining market
share. We have therefore had to develop our own on a communal voluntary
basis - all main Open Source chemical tools promote CML. There are signs
that the commercial manufacturers are starting to see the value of CML, but
they do this on a sporadic and uncoordinated basis. For example the OMG
lifesciences effort had ca 50 members in the biosciences and ca 2.5 in
chemistry. This effort was based on CML but we are normally expected to
contribute to this development at our own expense.
The pharmaceutical industry has been slow to see the value of XML as a
communal infrastructure - I have worked with WHO and FDA who see the
potential and have been trying to develop this approach. For example the
dossier that the FDA requires for a new drug runs to millions of page
equivalents and needs to integrate information from many sources. XML is
essential and valuable for this, but progress is slow.
The main visionaries for CML include:
- government (Patents, drug regulatory, environment, safety)
- health (drug discovery, safety)
- a few publishers (often outside chemistry)
- generic web technologists (e.g. IT companies)
It is disappointing how little interest there has been from pharma. I don't
know how widely this list is read, but let me make a plea for a concerted
effort here. I have worked with several other industries (media , finance,
energy, aerospace) where the need for a communal information infrastructure
at a precompetitive level is widely recognised. XML is the universal
approach.
IMO the pharma industry has many areas where it desperately needs a
communal information infrastructure. This include regulatory, drug
information and safety, basic ADMETox data, etc. I don't even know of a
site where I can get reliable information on current marketed drugs which I
can re-use without having to pay or violate copyright.
Minor comments follow:
>Eric,
>
> From my understanding of XPaths
> (http://www.w3.org/TR/xpath#section-Introduction), they can be used
> "within" URIs. So mapping an RDF statement to a specific Chem-XML node or
> group should be doable.
CMLRSS already does this for CML (sic).
>Eric.Neumann@aventis.com wrote:
> > Question: How would one apply RDF for such cases? Would one use CML
> > (chemical markup language) to describe the chemical structure and have
> > an RDF statement refer to part of that doc via XPath/XPointers?
Yes
> How
> > about other structural formats like SMILE and CHUCKLES?
These are semantically void in an XML environment. There is no mechanism
for discovering their semantics and there is virtually no non-commercial
software for their re-use.
> Would the
>This is an interesting question, and certainly also relevant to any
>classical bioinformatics data sources that contain more quantitative
>than qualitative data (e.g. 3D structures, 2D gel images and microarray
>data). I don't really have any solutions, just some ideas:
I believe that biosciences can and should take the lead here by compiling
source of high quality structure/property databases for molecules of
interest to bioscience.
>In those cases where it is possible to embed identifiers in the data,
>these could be referenced with identifiers such as
>urn:lsid:foo.org:bar:10. A resolution server can then be set up to
>extract the referenced data when required. Note that the original format
>need not contain full LSIDs.
We have a mechanism for embedding and referencing identifiers. The main
task is to have public repositories of agreed identifiers.
P
Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069
Received on Wednesday, 7 July 2004 14:50:13 UTC