- From: Peter Murray-Rust <pm286@cam.ac.uk>
- Date: Thu, 01 Jul 2004 08:55:26 +0100
- To: <Eric.Neumann@aventis.com>, <Eric.Jain@isb-sib.ch>, <public-semweb-lifesci@w3.org>
- Cc: h.rzepa@ic.ac.uk
At 14:45 30/06/2004 -0400, Eric.Neumann@aventis.com wrote: Greetings, I am excited to see the discussion on Chemistry and the Semantic Web and hope that this can read to some real progress. When we (Henry Rzepa and I) developed Chemical Markup Language (CML - please use this precise acronym) 10 years ago we saw the potential for semantic processing. CML was one of the very first non-textual content-based markup languages and as such requires its own toolset for authoring, editing, transformation, databases and much else. About 10 years ago we showed that MIME helper applications could add valuable semantic to chemical information and urged the use of SGML (and now XML). CML is now a mainstream scientific XML language. Last week I was invited to present to the NSF/NationalScienceDigitalLibrary meeting in Washington which looked to support the interoperability of major multidisciplinary languages (primarily physical sciences). Forward-looking publishers now see XML (and thereby these markup languages) as the main way forward for rich re-usable content. We have continued to develop CML technology and investigated most of the mature W3C technologies and their applicability to chemistry. These include: - DTD - XSLT - XMLSchema - RDF - RSS - XMLSignature - namespaces - SVG We now have components deployed and can create examples of the Chemical Semantic Web. Two examples: - CML Rss (http://wwmm.ch.cam.ac.uk/moin/CmlRss). This allows authors/publishers to create an RSS feed where items include CML. We have developed a CML-sensitive RSS client with a complete download kit and we urge you all to have a look! This has enormous potential for chemical publishing - a publisher can include complete information on a compound - 2D structure, 3D structure, properties, etc. The client can display, aggregate or filter this. We especially thank Timo Hannay and Ben Lund of Nature Publishing Group for showing us the proper way of adding RDF. Their urchin (http://urchin.sf.net) is capable of aggregating many such feeds. If authors adopted this then chemistry could become on of the first real semantic web applications - WWMM (world wide molecular matrix). http://wwmm.ch.cam.ac.uk/Bob In this vision we tackle the problem of creating a knowledge resource for chemistry. Users can donate micro-information (e.g. a single structure for which they get a free high-quality calculation of molecular properties using quantum mechanics methods. The only stipulation is that the results are then Openly available to the community (in CML). In this way the world can gain grow a semantic resource. These are only two examples of possible semantically enhanced web technologies. Both could be deployed within enterprises. Technically, therefore, the chemical semantic web is already possible.Unfortunately there is a major cultural problem, which I hope isn't out of scope on this list. (BTW I spent may years working in pharma industry (Glaxo) and shall particularly comment on the potential and difficulties there) In our opinions the first generation of the semantic web will be built on open systems on the public Internet. For rapid development the resources must be open. (Current robots give up when asked to register for a site, add their names and emails, etc.). In biosciences this has been spectacular and a robot can access and re-use a wide range of high quality and comprehensive data (genomes, sequences, structures, etc.) Much of the full-text literature is now being made Openly available and bioscience publishers are looking at new publication models. This would allow the primary literature to become a primary knowledge base for the semantic web. In chemistry almost all information is "owned". An author signs over the copyright to the publisher which usually forbids them to post the full text on the Web, even years after publication. Thus Henry and I cannot send you the text of our manuscripts on the Chemical Semantic Web - you will have to subscribe to the American Chemical Society. My views on the inappropriateness of this are well publicised, but I believe that copyright should be honoured. Similarly all data is "owned" by secondary data producers such as Chemical Abstracts, Beilstein , Derwent, Thomson , etc. In the days of paper this had to be created by manual rekeying, but we now have tools that can read and understand primary chemical publications. Whether they can be legally deployed and for what purpose is unclear. It would seem that a low cost human may extract and key information but it is less clear whether a robot can do the same. But this has to be where the future lies. The chemical semantic web requires tools. The markup languages for math, geography, etc. are supported by commercial suppliers. Chemical software producers have been very slow to adopt XML ("there is no market for it") and prefer byzantine proprietary formats as a way of maintaining market share. We have therefore had to develop our own on a communal voluntary basis - all main Open Source chemical tools promote CML. There are signs that the commercial manufacturers are starting to see the value of CML, but they do this on a sporadic and uncoordinated basis. For example the OMG lifesciences effort had ca 50 members in the biosciences and ca 2.5 in chemistry. This effort was based on CML but we are normally expected to contribute to this development at our own expense. The pharmaceutical industry has been slow to see the value of XML as a communal infrastructure - I have worked with WHO and FDA who see the potential and have been trying to develop this approach. For example the dossier that the FDA requires for a new drug runs to millions of page equivalents and needs to integrate information from many sources. XML is essential and valuable for this, but progress is slow. The main visionaries for CML include: - government (Patents, drug regulatory, environment, safety) - health (drug discovery, safety) - a few publishers (often outside chemistry) - generic web technologists (e.g. IT companies) It is disappointing how little interest there has been from pharma. I don't know how widely this list is read, but let me make a plea for a concerted effort here. I have worked with several other industries (media , finance, energy, aerospace) where the need for a communal information infrastructure at a precompetitive level is widely recognised. XML is the universal approach. IMO the pharma industry has many areas where it desperately needs a communal information infrastructure. This include regulatory, drug information and safety, basic ADMETox data, etc. I don't even know of a site where I can get reliable information on current marketed drugs which I can re-use without having to pay or violate copyright. Minor comments follow: >Eric, > > From my understanding of XPaths > (http://www.w3.org/TR/xpath#section-Introduction), they can be used > "within" URIs. So mapping an RDF statement to a specific Chem-XML node or > group should be doable. CMLRSS already does this for CML (sic). >Eric.Neumann@aventis.com wrote: > > Question: How would one apply RDF for such cases? Would one use CML > > (chemical markup language) to describe the chemical structure and have > > an RDF statement refer to part of that doc via XPath/XPointers? Yes > How > > about other structural formats like SMILE and CHUCKLES? These are semantically void in an XML environment. There is no mechanism for discovering their semantics and there is virtually no non-commercial software for their re-use. > Would the >This is an interesting question, and certainly also relevant to any >classical bioinformatics data sources that contain more quantitative >than qualitative data (e.g. 3D structures, 2D gel images and microarray >data). I don't really have any solutions, just some ideas: I believe that biosciences can and should take the lead here by compiling source of high quality structure/property databases for molecules of interest to bioscience. >In those cases where it is possible to embed identifiers in the data, >these could be referenced with identifiers such as >urn:lsid:foo.org:bar:10. A resolution server can then be set up to >extract the referenced data when required. Note that the original format >need not contain full LSIDs. We have a mechanism for embedding and referencing identifiers. The main task is to have public repositories of agreed identifiers. P Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069
Received on Wednesday, 7 July 2004 14:50:13 UTC