Future of FlyWeb work on Drosophila RDF Data

Dear all,

As I mentioned in the FlyWeb announcement email earlier today [1], the
FlyWeb project is currently winding up. While the current set of
applications and services, including the SPARQL endpoints, will
persist in their current state at least until May 2010, we do not have
continued funding to extend this work, or to track changes to the
underlying data sources.

This email provides some details of the four main RDF datasets
associated with Drosophila (fruit flies) that we have produced. If the
IG sees fit, I would be more than happy for these datasets to be
incorporated and maintained within the HCLS KB. I would also be happy
for any member(s) of the IG to take this up independently.

== FlyBase ==

The largest RDF dataset we have generated is derived from the FlyBase
Drosophila genome database (flybase.org). FlyBase contains a diverse
set of well-curated genome-associated data, and is the primary
resource for Drosophila genomics.

The dataset released in FM3 is based on FlyBase version
FB2009_02. FlyBase release a new version of their underlying database
roughly once per month [2], and the current version is FB2009_04, so
we are already two versions behind. However, afaik the schema hasn't
changed, so the D2RQ maps should still be applicable.

Our FM3 FlyBase dataset is ~175 million triples. Full details of the
dataset, with links to downloads, D2RQ maps used to generate the
dataset, details of URI design, and location of SPARQL endpoint, are
described at:

http://code.google.com/p/openflydata/wiki/FlyBaseMilestone3

The D2RQ maps used to generate the data are divided into a number of
separate mapping files, based around the modular structure of the
Chado schema [3]. These mapping files are currently available from the
openflydata code project:

http://openflydata.googlecode.com/

Specifically, all of the D2RQ maps for FlyBase are under the
trunk/chado svn repository path. They are *not* under the
trunk/flybase repository path -- that is earlier work, now superseded.

Note that similar D2RQ mapping files are also available for GeneDB,
which holds genomic data for the Sanger Pathogen Sequencing Unit (37
parasite genomes). GeneDB also has a publicly accessible database
instance based on the Chado schema [5], so I ported the D2RQ maps for
FlyBase to GeneDB, as an experiment.

We used D2R server's dump-rdf utility to generate N-TRIPLES dumps from
FlyBase. Note that, at the time of writing, D2R server's dump-rdf
utility has some scalability limitations. I encountered no problems
with the smaller GeneDB on normal desktop machines, but when working
with the larger FlyBase I had to use machines (m1.xlarge ec2
instances) with a lot of RAM to get the transformation to
complete. This is entirely due to the fact that D2R server doesn't
make use of the JDBC capability to fetch SQL results a bit at a time,
based on a cursor, rather than fetching the whole thing in one
go. Recently I submitted a patch to the D2R team which fixes this,
enabling any of the mappings to be run on a much smaller machine,
which may see its way into the next D2R release.

As a point of interest, some data comparing performance of a
TDB-backed SPARQL endpoint with the FlyBase relational database for
some comparable SQL and SPARQL queries are at:

http://code.google.com/p/openflydata/wiki/FlyBaseBenchmark

== BDGP In Situ Database ==

The Berkeley Drosophila Genome Project (BDGP) (fruitfly.org) maintains
a public database of mRNA in situ hybridisation images in Drosophila
embryos at different stages of embryo development [6]. This is an
extremely valuable source of gene expression data for Drosophila
functional genomics.

Details of our latest release of an RDF dataset derived from the BDGP
in situ database are available at:

http://code.google.com/p/openflydata/wiki/Bdgp

D2RQ maps for this database are available from the openflydata code
project:

http://openflydata.googlecode.com/

See the trunk/bdgp path in the svn repository. Jun Zhao was leading
the work on BDGP, she can answer any further queries regarding this
dataset.

The BDGP database changes much less frequently than FlyBase - afaik
the database hasn't changed since 20070309.

== FlyAtlas ==

FlyAtlas (flyatlas.org) is an online database of tissue-specific DNA
microarray data for Drosophila. It is complementary to BDGP, providing
quantitative data on gene expression in a number of adult and larval
tissues. This is also an invaluable gene expression data source for
Drosophila functional genomics.

FlyAtlas provide a spreadsheet download of their data. We (Graham
Klyne) wrote a Python conversion utility that parses the spreadsheet
and outputs a Turtle format RDF dump.  Further details are available
at:

http://code.google.com/p/openflydata/wiki/Flyatlas

The last update to FlyAtlas was in November last year, when data on 5
new tissues were added. Our current RDF dataset is from the previous
FlyAtlas release. We haven't updated our scripts to cope with the
newer data. 

Note that, to link FlyAtlas data to FlyBase data you need probe
annotation tables from Affymetrix. The tables, also available as a
spreadsheet download, map Drosophila 2 microarray probe identifiers to
FlyBase gene identifiers. We wrote another Python script to convert
that to N-TRIPLES, which we merged with the FlyAtlas data. That script
(Probe2Gene.py) is also available from the same location in the
openflydata code project. Affymetrix do periodically release updates
to that table, and our latest dataset is not the most current, see [7]
for latest.

== FlyTED ==

FlyTED is the Drosophila Testis Gene Expression Database, publishing
images of mRNA in situ hybridisation in Drosophila testes for several
hundred genes. It is a valuable resource for a more specific aspect of
Drosophila developmental biology (spermatogenesis).

FlyTED was developed and is maintained by Jun Zhao, so she is the best
person to contact re details of this database. Information on the RDF
dataset derived from FlyTED is at:

http://code.google.com/p/openflydata/wiki/Flyted

Jun wrote a Java program to harvest metadata from FlyTED via OAI-PMH,
then convert it to Turtle. She is currently handling some final
updates to the database, but after that we expect the database to
remain static.

----

A few last words...

I hope the excellent work of the IG on the integrated knowledge base
continues, and we see a much expanded coverage of linked datasets
across the life science domains, made available via robust and
performant SPARQL endpoints.

I would particularly like to emphasise the central role played by
model organism databases such as FlyBase. I would love to see stable,
well-engineered, and up-to-date RDF conversions available for all the
major model organism databases, which could then act as a hub for
linking the large number of peripheral databases.

One of our biggest challenges in FlyWeb has been dealing with the
vulnerability of open sparql endpoints to denial-of-service-type
problems. We explored some ideas for mitigating these problems via the
experimental sparqlite sparql protocol implementation, and we have
found the Jena TDB storage and query engine to perform well, however
we are conscious that we only have partial solutions at best. SPARQL
is compelling because it provides an expressive, open-ended query
protocol, supporting a wide range of requirements. However, if
service-level guarantees cannot be provided for open sparql endpoints,
it is hard to make a firm business case for migrating production
systems. I hope we see this resolved in open-source implementations of
the sparql protocol in the not-too-distant future. If someone has
solved this already, then I'd love to hear about it!

Best wishes,

Alistair

[1] http://lists.w3.org/Archives/Public/public-semweb-lifesci/2009May/0031.html
[2] http://flybase.org/forums/viewtopic.php?f=4&t=110&sid=c234dd240ffdc57ab2db75e5f5408815
[3] http://gmod.org/wiki/Chado
[4] http://www.genedb.org/
[5] http://gmod.org/wiki/Public_Chado_Databases
[6] http://www.fruitfly.org/cgi-bin/ex/insitu.pl
[7] http://www.affymetrix.com/support/technical/byproduct.affx?product=fly-20

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: alistair.miles@zoo.ox.ac.uk
Tel: +44 (0)1865 281993

Received on Tuesday, 12 May 2009 17:29:30 UTC