Re: LODD Telcon from Egon Willighagen on 2009-11-23 (public-semweb-lifesci@w3.org from November 2009)

From: Egon Willighagen <egon.willighagen@gmail.com>
Date: Mon, 23 Nov 2009 21:29:23 +0100
To: Susie Stephens <susie.stephens@gmail.com>
Cc: public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>, Bioclipse-devel ML <bioclipse-devel@lists.sourceforge.net>
Message-ID: <6aeb064b0911231229v76b400d3g502bb7dbe27b531b@mail.gmail.com>
Hi all,

next Wednesday I unfortunately cannot participate because of family obligations.

On Mon, Nov 23, 2009 at 5:19 PM, Susie Stephens
<susie.stephens@gmail.com> wrote:
> Here's the reminder for Wednesday's LODD telcon.

I was up for a data update, so will have to do like this... my
introduction to this list is ancient, so before. My background is
cheminformatics and chemometrics (statistics/data analysis on chemical
data). I'm a strong believer in Open Data, Open Source and Open
Standards, and (past) developer of several projects, including
Strigi-chemical (chemistry extension for the KDE desktop search
engine), the Chemistry Development Kit, JChemPaint, Jmol, Jmol, and
several other ones. Right now, I am postdoc in a drug discovery group
at Uppsala University (Prof. Wikberg) and developing the
cheminformatics use at the department, which includes the Bioclipse
workbench.

Proteochemometrics is the main statistical method used in our group,
and model validation is clearly important. This is where RDF comes in:
aggregation of data before model building, and for model validation
afterwards. The latter will preferably be data which is related to the
model, and not really of the same type. RDF is clearly one of the few
methods up to this job.

When I first joined the HCLS mailing list and conf calls, I saw very
much focus on biological data, clinical data, but a lack of focus on
the molecular chemistry behind all, which is actually crucial for the
cheminformatics and proteochemometrics.

So, that more or less defines the area where I contribute to the RDF
activities... the border of molecular data and drug-related
properties.

So far, I have developed an extension for Bioclipse to deal with RDF,
and it currently supports an in memory triple store, SPARQL queries on
the in memory stores as well as on remote SPARQL end points. Like the
most of Bioclipse2, it is scriptable, which allows easy building of
small programs or workflows to integrate RDF into other Bioclipse
extension, including the cheminformatics functionality, but also Jmol.
There is also an R interface, to bridge with statistical modeling.

Last week Friday, I gave a talk about this work at SWAT4LS in
Amsterdam, and my slides are available in my blog [0].

Getting back to the data, I am working on making various unique
molecular property resources available as RDF. This includes the GNU
FDL-licensed NMRShiftDB data, which contains NMR spectra (mostly
carbon-13) used for metabolite identification (think finding
biomarkers). There are also two smaller CC0 data sets, one based on
ChemPedia [1], a new crowd-sourcing endeavor for naming molecules (no
i18n support yet, but requested), and the RDF Open Notebook Science
Solubility project [2], which we described in a Chapter in the recent
Beautiful Data book from O'Reilly.

There are other things I am doing, which include an ontology for
molecular (or QSAR) descriptors, and a RDF equivalent for the
cheminformatics data model used by the CDK. This would, though I am
myself not convinced this is really where we want to go, allow
serialization of full molecular structures as RDF data, though parts
of this may very well be rather useful for XHTML+RDFa for scientific
publication of, for example, organic synthesis papers...

I'd very much like to help get these data sets into the LODD network
(particular the last two, which are easiest because of the CC0
license).

One thing I want to do soon (actually, as part of the SWAT4LS
proceedings paper), is create a data set with CDK-based molecular
similarities. The CDK can calculate various, and this will create a
nice sparse matrix. I'm leaning towards doing the molecules in
DBPedia, but and more than Open to analyse other Open data sets too
(bearing a proper license, or proper Public Domain statement, like
CC0). I'll put up the final script on MyExperiment.org anyway, for
others to analyze other data sets. No ETA for that, though.

An example script downloads molecules from DBPedia and visualizes them
2D in a molecule table [3,4].

I am looking forward to hearing your comments and ideas on this work.

Regards,

Egon

0.http://chem-bla-ics.blogspot.com/2009/11/swat4ls-linking-open-drug-data-to.html
1.http://chem-bla-ics.blogspot.com/2009/11/chempedia-rdf-1-sparql-end-point.html
2.http://chem-bla-ics.blogspot.com/2009/11/open-notebook-science-solubility-sparql.html
3.http://egonw.posterous.com/molecules-in-dbpedia-visualized-with-bioclips
4.http://www.myexperiment.org/workflows/927

-- 
Post-doc @ Uppsala University
Homepage: http://egonw.github.com/
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
Received on Monday, 23 November 2009 20:30:24 UTC