Re: Future of FlyWeb work on Drosophila RDF Data

Just want to share the following news article:

http://www.medicalnewstoday.com/articles/53437.php

The title of the article is: "A Valuale Fly For Research Into Cancer, 
Drug Addiction, Neurodegenerative Diseases, Epilepsy And More"

Cheers,

-Kei

Alistair Miles wrote:
> Dear all,
>
> As I mentioned in the FlyWeb announcement email earlier today [1], the
> FlyWeb project is currently winding up. While the current set of
> applications and services, including the SPARQL endpoints, will
> persist in their current state at least until May 2010, we do not have
> continued funding to extend this work, or to track changes to the
> underlying data sources.
>
> This email provides some details of the four main RDF datasets
> associated with Drosophila (fruit flies) that we have produced. If the
> IG sees fit, I would be more than happy for these datasets to be
> incorporated and maintained within the HCLS KB. I would also be happy
> for any member(s) of the IG to take this up independently.
>
> == FlyBase ==
>
> The largest RDF dataset we have generated is derived from the FlyBase
> Drosophila genome database (flybase.org). FlyBase contains a diverse
> set of well-curated genome-associated data, and is the primary
> resource for Drosophila genomics.
>
> The dataset released in FM3 is based on FlyBase version
> FB2009_02. FlyBase release a new version of their underlying database
> roughly once per month [2], and the current version is FB2009_04, so
> we are already two versions behind. However, afaik the schema hasn't
> changed, so the D2RQ maps should still be applicable.
>
> Our FM3 FlyBase dataset is ~175 million triples. Full details of the
> dataset, with links to downloads, D2RQ maps used to generate the
> dataset, details of URI design, and location of SPARQL endpoint, are
> described at:
>
> http://code.google.com/p/openflydata/wiki/FlyBaseMilestone3
>
> The D2RQ maps used to generate the data are divided into a number of
> separate mapping files, based around the modular structure of the
> Chado schema [3]. These mapping files are currently available from the
> openflydata code project:
>
> http://openflydata.googlecode.com/
>
> Specifically, all of the D2RQ maps for FlyBase are under the
> trunk/chado svn repository path. They are *not* under the
> trunk/flybase repository path -- that is earlier work, now superseded.
>
> Note that similar D2RQ mapping files are also available for GeneDB,
> which holds genomic data for the Sanger Pathogen Sequencing Unit (37
> parasite genomes). GeneDB also has a publicly accessible database
> instance based on the Chado schema [5], so I ported the D2RQ maps for
> FlyBase to GeneDB, as an experiment.
>
> We used D2R server's dump-rdf utility to generate N-TRIPLES dumps from
> FlyBase. Note that, at the time of writing, D2R server's dump-rdf
> utility has some scalability limitations. I encountered no problems
> with the smaller GeneDB on normal desktop machines, but when working
> with the larger FlyBase I had to use machines (m1.xlarge ec2
> instances) with a lot of RAM to get the transformation to
> complete. This is entirely due to the fact that D2R server doesn't
> make use of the JDBC capability to fetch SQL results a bit at a time,
> based on a cursor, rather than fetching the whole thing in one
> go. Recently I submitted a patch to the D2R team which fixes this,
> enabling any of the mappings to be run on a much smaller machine,
> which may see its way into the next D2R release.
>
> As a point of interest, some data comparing performance of a
> TDB-backed SPARQL endpoint with the FlyBase relational database for
> some comparable SQL and SPARQL queries are at:
>
> http://code.google.com/p/openflydata/wiki/FlyBaseBenchmark
>
> == BDGP In Situ Database ==
>
> The Berkeley Drosophila Genome Project (BDGP) (fruitfly.org) maintains
> a public database of mRNA in situ hybridisation images in Drosophila
> embryos at different stages of embryo development [6]. This is an
> extremely valuable source of gene expression data for Drosophila
> functional genomics.
>
> Details of our latest release of an RDF dataset derived from the BDGP
> in situ database are available at:
>
> http://code.google.com/p/openflydata/wiki/Bdgp
>
> D2RQ maps for this database are available from the openflydata code
> project:
>
> http://openflydata.googlecode.com/
>
> See the trunk/bdgp path in the svn repository. Jun Zhao was leading
> the work on BDGP, she can answer any further queries regarding this
> dataset.
>
> The BDGP database changes much less frequently than FlyBase - afaik
> the database hasn't changed since 20070309.
>
> == FlyAtlas ==
>
> FlyAtlas (flyatlas.org) is an online database of tissue-specific DNA
> microarray data for Drosophila. It is complementary to BDGP, providing
> quantitative data on gene expression in a number of adult and larval
> tissues. This is also an invaluable gene expression data source for
> Drosophila functional genomics.
>
> FlyAtlas provide a spreadsheet download of their data. We (Graham
> Klyne) wrote a Python conversion utility that parses the spreadsheet
> and outputs a Turtle format RDF dump.  Further details are available
> at:
>
> http://code.google.com/p/openflydata/wiki/Flyatlas
>
> The last update to FlyAtlas was in November last year, when data on 5
> new tissues were added. Our current RDF dataset is from the previous
> FlyAtlas release. We haven't updated our scripts to cope with the
> newer data. 
>
> Note that, to link FlyAtlas data to FlyBase data you need probe
> annotation tables from Affymetrix. The tables, also available as a
> spreadsheet download, map Drosophila 2 microarray probe identifiers to
> FlyBase gene identifiers. We wrote another Python script to convert
> that to N-TRIPLES, which we merged with the FlyAtlas data. That script
> (Probe2Gene.py) is also available from the same location in the
> openflydata code project. Affymetrix do periodically release updates
> to that table, and our latest dataset is not the most current, see [7]
> for latest.
>
> == FlyTED ==
>
> FlyTED is the Drosophila Testis Gene Expression Database, publishing
> images of mRNA in situ hybridisation in Drosophila testes for several
> hundred genes. It is a valuable resource for a more specific aspect of
> Drosophila developmental biology (spermatogenesis).
>
> FlyTED was developed and is maintained by Jun Zhao, so she is the best
> person to contact re details of this database. Information on the RDF
> dataset derived from FlyTED is at:
>
> http://code.google.com/p/openflydata/wiki/Flyted
>
> Jun wrote a Java program to harvest metadata from FlyTED via OAI-PMH,
> then convert it to Turtle. She is currently handling some final
> updates to the database, but after that we expect the database to
> remain static.
>
> ----
>
> A few last words...
>
> I hope the excellent work of the IG on the integrated knowledge base
> continues, and we see a much expanded coverage of linked datasets
> across the life science domains, made available via robust and
> performant SPARQL endpoints.
>
> I would particularly like to emphasise the central role played by
> model organism databases such as FlyBase. I would love to see stable,
> well-engineered, and up-to-date RDF conversions available for all the
> major model organism databases, which could then act as a hub for
> linking the large number of peripheral databases.
>
> One of our biggest challenges in FlyWeb has been dealing with the
> vulnerability of open sparql endpoints to denial-of-service-type
> problems. We explored some ideas for mitigating these problems via the
> experimental sparqlite sparql protocol implementation, and we have
> found the Jena TDB storage and query engine to perform well, however
> we are conscious that we only have partial solutions at best. SPARQL
> is compelling because it provides an expressive, open-ended query
> protocol, supporting a wide range of requirements. However, if
> service-level guarantees cannot be provided for open sparql endpoints,
> it is hard to make a firm business case for migrating production
> systems. I hope we see this resolved in open-source implementations of
> the sparql protocol in the not-too-distant future. If someone has
> solved this already, then I'd love to hear about it!
>
> Best wishes,
>
> Alistair
>
> [1] http://lists.w3.org/Archives/Public/public-semweb-lifesci/2009May/0031.html
> [2] http://flybase.org/forums/viewtopic.php?f=4&t=110&sid=c234dd240ffdc57ab2db75e5f5408815
> [3] http://gmod.org/wiki/Chado
> [4] http://www.genedb.org/
> [5] http://gmod.org/wiki/Public_Chado_Databases
> [6] http://www.fruitfly.org/cgi-bin/ex/insitu.pl
> [7] http://www.affymetrix.com/support/technical/byproduct.affx?product=fly-20
>
>   

Received on Thursday, 14 May 2009 14:53:51 UTC