FW: RE: Chemistry and the Semantic Web

[Forwarded to the list on behalf of Peter Murray-Rust who's having
difficulty posting.]

>Date: Thu, 01 Jul 2004 08:55:26 +0100
>To: <Eric.Neumann@aventis.com>,
><Eric.Jain@isb-sib.ch>,<public-semweb-lifesci@w3.org>
>From: Peter Murray-Rust <pm286@cam.ac.uk>
>Subject: RE: Chemistry and the Semantic Web
>Cc: h.rzepa@ic.ac.uk
>
>At 14:45 30/06/2004 -0400, Eric.Neumann@aventis.com wrote:
>
>
>
>Greetings,
>
>I am excited to see the discussion on Chemistry and the Semantic Web 
>and
>hope that this can read to some real progress.
>
>When we (Henry Rzepa and I) developed Chemical Markup Language (CML -
>please use this precise acronym) 10 years ago we saw the  potential for 
>semantic processing. CML was one of the very first non-textual 
>content-based markup languages and as such requires its own toolset for 
>authoring, editing, transformation, databases and much else. About 10 
>years ago we showed that MIME helper applications could add valuable 
>semantic to chemical information and urged the use of SGML (and now XML).
>
>CML is now a mainstream scientific XML language. Last week I was 
>invited
>to present to the NSF/NationalScienceDigitalLibrary meeting in Washington 
>which looked to support the interoperability of major multidisciplinary 
>languages (primarily physical sciences). Forward-looking publishers now 
>see XML (and thereby these markup languages) as the main way forward for 
>rich re-usable content.
>
>We have continued to develop CML technology and investigated most of 
>the
>mature W3C technologies and their applicability to chemistry. These
include:
>- DTD
>- XSLT
>- XMLSchema
>- RDF
>- RSS
>- XMLSignature
>- namespaces
>- SVG
>
>We now have components deployed and can create examples of the Chemical
>Semantic Web.  Two examples:
>- CML Rss (http://wwmm.ch.cam.ac.uk/moin/CmlRss). This allows 
>authors/publishers to create an RSS feed where items include CML. We have 
>developed a CML-sensitive RSS client with a complete download kit and we 
>urge you all to have a look! This has enormous potential for chemical 
>publishing - a publisher can include complete information on a compound - 
>2D structure, 3D structure, properties, etc. The client can display, 
>aggregate or filter this. We especially thank Timo Hannay and Ben Lund of 
>Nature Publishing Group for showing us the proper way of adding RDF. Their 
>urchin (http://urchin.sf.net) is capable of aggregating many such feeds. 
>If authors adopted this then chemistry could become on of the first real 
>semantic web applications
>- WWMM (world wide molecular matrix). http://wwmm.ch.cam.ac.uk/Bob  In 
>this vision we tackle the problem of creating a knowledge resource for 
>chemistry. Users can donate micro-information (e.g. a single structure for 
>which they get a free high-quality calculation of molecular properties 
>using quantum mechanics methods. The only stipulation is that the results 
>are then Openly available to the community (in CML). In this way the world 
>can gain grow a semantic resource.
>
>These are only two examples of possible semantically enhanced web
>technologies. Both could be deployed within enterprises.
>
>Technically, therefore, the chemical semantic web is already
>possible.Unfortunately there is a major cultural  problem, which I hope 
>isn't out of scope on this list. (BTW I spent may years working in pharma 
>industry (Glaxo) and shall particularly comment on the potential and 
>difficulties there)
>
>In our opinions the first generation of the semantic web will be built 
>on
>open systems on the public Internet. For rapid development the resources 
>must be open. (Current robots give up when asked to register for a site, 
>add their names and emails, etc.). In biosciences this has been 
>spectacular and a robot can access and re-use a wide range of high quality 
>and comprehensive data (genomes, sequences, structures, etc.) Much of the 
>full-text literature is now being made Openly available and bioscience 
>publishers are looking at new publication models. This would allow the 
>primary literature to become a primary knowledge base for the semantic web.
>
>In chemistry almost all information is "owned". An author signs over 
>the
>copyright to the publisher which usually forbids them to post the full 
>text on the Web, even years after publication. Thus Henry and I cannot 
>send you the text of our manuscripts on the Chemical Semantic Web - you 
>will have to subscribe to the American Chemical Society. My views on the 
>inappropriateness of this are well publicised, but I believe that 
>copyright should be honoured. Similarly all data is "owned" by secondary 
>data producers such as Chemical Abstracts, Beilstein , Derwent, Thomson , 
>etc. In the days of paper this had to be created by manual rekeying, but 
>we now have tools that can read and understand primary chemical 
>publications. Whether they can be legally deployed and for what purpose is 
>unclear. It would seem that a low cost human may extract and key 
>information but it is less clear whether a robot can do the same. But this 
>has to be where the future lies.
>
>The chemical semantic web requires tools. The markup languages for 
>math,
>geography, etc. are supported by commercial suppliers. Chemical software 
>producers have been very slow to adopt XML ("there is no market for it") 
>and prefer byzantine proprietary formats as a way of maintaining market 
>share. We have therefore had to develop our own on a communal voluntary 
>basis - all main Open Source chemical tools promote CML. There are signs 
>that the commercial manufacturers are starting to see the value of CML, 
>but they do this on a sporadic and uncoordinated basis. For example the 
>OMG lifesciences effort had ca 50 members in the biosciences and ca 2.5 in 
>chemistry. This effort was based on CML but we are normally expected to 
>contribute to this development at our own expense.
>
>The pharmaceutical industry has been slow to see the value of XML as a
>communal infrastructure - I have worked with WHO and FDA who see the 
>potential and have been trying to develop this approach. For example the 
>dossier that the FDA requires for a new drug runs to millions of page 
>equivalents and needs to integrate information from many sources. XML is 
>essential and valuable for this, but progress is slow.
>
>The main visionaries for CML include:
>- government (Patents, drug regulatory, environment, safety)
>- health (drug discovery, safety)
>- a few publishers (often outside chemistry)
>- generic web technologists (e.g. IT companies)
>
>It is disappointing how little interest there has been from pharma. I
>don't know how widely this list is read, but let me make a plea for a 
>concerted effort here. I have worked with several other industries (media 
>, finance, energy, aerospace) where the need for a communal information 
>infrastructure at a precompetitive level is widely recognised.  XML is the 
>universal approach.
>
>IMO the pharma industry has many areas where it desperately needs a
>communal information infrastructure. This include regulatory, drug 
>information and safety, basic ADMETox data, etc. I don't even know of a 
>site where I can get reliable information on current marketed drugs which 
>I can re-use without having to pay or violate copyright.
>
>Minor comments follow:
>
>
>>Eric,
>>
>> From my understanding of XPaths
>> (http://www.w3.org/TR/xpath#section-Introduction), they can be used 
>> "within" URIs. So mapping an RDF statement to a specific Chem-XML node 
>> or group should be doable.
>
>
>CMLRSS already does this for CML (sic).
>
>
>>Eric.Neumann@aventis.com wrote:
>> > Question: How would one apply RDF for such cases? Would one use CML 
>> > (chemical markup language) to describe the chemical structure and 
>> > have an RDF statement refer to part of that doc via 
>> > XPath/XPointers?
>
>Yes
>
>>  How
>> > about other structural formats like SMILE and CHUCKLES?
>
>These are semantically void in an XML environment. There is no 
>mechanism
>for discovering their semantics and there is virtually no non-commercial 
>software for their re-use.
>
>>  Would the
>>This is an interesting question, and certainly also relevant to any 
>>classical bioinformatics data sources that contain more quantitative 
>>than qualitative data (e.g. 3D structures, 2D gel images and 
>>microarray data). I don't really have any solutions, just some ideas:
>
>I believe that biosciences can and should take the lead here by 
>compiling
>source of high quality structure/property databases for molecules of 
>interest to bioscience.
>
>>In those cases where it is possible to embed identifiers in the data, 
>>these could be referenced with identifiers such as 
>>urn:lsid:foo.org:bar:10. A resolution server can then be set up to 
>>extract the referenced data when required. Note that the original 
>>format need not contain full LSIDs.
>
>We have a mechanism for embedding and referencing identifiers. The main
>task is to have public repositories of agreed identifiers.
>
>P
>
>Peter Murray-Rust
>Unilever Centre for Molecular Informatics
>Chemistry Department, Cambridge University
>Lensfield Road, CAMBRIDGE, CB2 1EW, UK
>Tel: +44-1223-763069

Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069



********************************************************************************
DISCLAIMER: This e-mail is confidential and should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage mechanism. Neither Macmillan Publishers Limited nor any of its agents accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of Macmillan Publishers Limited or one of its agents. Please note that neither Macmillan Publishers Limited nor any of its agents accept any responsibility for viruses that may be contained in this e-mail or its attachments and it is your responsibility to scan the e-mail and attachments (if any). No contracts may be concluded on behalf of Macmillan Publishers Limited or its agents by means of e-mail communication. Macmillan Publishers Limited Registered in England and Wales with registered number 785998 Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS
********************************************************************************

Received on Thursday, 1 July 2004 08:16:33 UTC