- From: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
- Date: Tue, 12 May 2009 18:28:46 +0100
- To: public-semweb-lifesci@w3.org
Dear all, As I mentioned in the FlyWeb announcement email earlier today [1], the FlyWeb project is currently winding up. While the current set of applications and services, including the SPARQL endpoints, will persist in their current state at least until May 2010, we do not have continued funding to extend this work, or to track changes to the underlying data sources. This email provides some details of the four main RDF datasets associated with Drosophila (fruit flies) that we have produced. If the IG sees fit, I would be more than happy for these datasets to be incorporated and maintained within the HCLS KB. I would also be happy for any member(s) of the IG to take this up independently. == FlyBase == The largest RDF dataset we have generated is derived from the FlyBase Drosophila genome database (flybase.org). FlyBase contains a diverse set of well-curated genome-associated data, and is the primary resource for Drosophila genomics. The dataset released in FM3 is based on FlyBase version FB2009_02. FlyBase release a new version of their underlying database roughly once per month [2], and the current version is FB2009_04, so we are already two versions behind. However, afaik the schema hasn't changed, so the D2RQ maps should still be applicable. Our FM3 FlyBase dataset is ~175 million triples. Full details of the dataset, with links to downloads, D2RQ maps used to generate the dataset, details of URI design, and location of SPARQL endpoint, are described at: http://code.google.com/p/openflydata/wiki/FlyBaseMilestone3 The D2RQ maps used to generate the data are divided into a number of separate mapping files, based around the modular structure of the Chado schema [3]. These mapping files are currently available from the openflydata code project: http://openflydata.googlecode.com/ Specifically, all of the D2RQ maps for FlyBase are under the trunk/chado svn repository path. They are *not* under the trunk/flybase repository path -- that is earlier work, now superseded. Note that similar D2RQ mapping files are also available for GeneDB, which holds genomic data for the Sanger Pathogen Sequencing Unit (37 parasite genomes). GeneDB also has a publicly accessible database instance based on the Chado schema [5], so I ported the D2RQ maps for FlyBase to GeneDB, as an experiment. We used D2R server's dump-rdf utility to generate N-TRIPLES dumps from FlyBase. Note that, at the time of writing, D2R server's dump-rdf utility has some scalability limitations. I encountered no problems with the smaller GeneDB on normal desktop machines, but when working with the larger FlyBase I had to use machines (m1.xlarge ec2 instances) with a lot of RAM to get the transformation to complete. This is entirely due to the fact that D2R server doesn't make use of the JDBC capability to fetch SQL results a bit at a time, based on a cursor, rather than fetching the whole thing in one go. Recently I submitted a patch to the D2R team which fixes this, enabling any of the mappings to be run on a much smaller machine, which may see its way into the next D2R release. As a point of interest, some data comparing performance of a TDB-backed SPARQL endpoint with the FlyBase relational database for some comparable SQL and SPARQL queries are at: http://code.google.com/p/openflydata/wiki/FlyBaseBenchmark == BDGP In Situ Database == The Berkeley Drosophila Genome Project (BDGP) (fruitfly.org) maintains a public database of mRNA in situ hybridisation images in Drosophila embryos at different stages of embryo development [6]. This is an extremely valuable source of gene expression data for Drosophila functional genomics. Details of our latest release of an RDF dataset derived from the BDGP in situ database are available at: http://code.google.com/p/openflydata/wiki/Bdgp D2RQ maps for this database are available from the openflydata code project: http://openflydata.googlecode.com/ See the trunk/bdgp path in the svn repository. Jun Zhao was leading the work on BDGP, she can answer any further queries regarding this dataset. The BDGP database changes much less frequently than FlyBase - afaik the database hasn't changed since 20070309. == FlyAtlas == FlyAtlas (flyatlas.org) is an online database of tissue-specific DNA microarray data for Drosophila. It is complementary to BDGP, providing quantitative data on gene expression in a number of adult and larval tissues. This is also an invaluable gene expression data source for Drosophila functional genomics. FlyAtlas provide a spreadsheet download of their data. We (Graham Klyne) wrote a Python conversion utility that parses the spreadsheet and outputs a Turtle format RDF dump. Further details are available at: http://code.google.com/p/openflydata/wiki/Flyatlas The last update to FlyAtlas was in November last year, when data on 5 new tissues were added. Our current RDF dataset is from the previous FlyAtlas release. We haven't updated our scripts to cope with the newer data. Note that, to link FlyAtlas data to FlyBase data you need probe annotation tables from Affymetrix. The tables, also available as a spreadsheet download, map Drosophila 2 microarray probe identifiers to FlyBase gene identifiers. We wrote another Python script to convert that to N-TRIPLES, which we merged with the FlyAtlas data. That script (Probe2Gene.py) is also available from the same location in the openflydata code project. Affymetrix do periodically release updates to that table, and our latest dataset is not the most current, see [7] for latest. == FlyTED == FlyTED is the Drosophila Testis Gene Expression Database, publishing images of mRNA in situ hybridisation in Drosophila testes for several hundred genes. It is a valuable resource for a more specific aspect of Drosophila developmental biology (spermatogenesis). FlyTED was developed and is maintained by Jun Zhao, so she is the best person to contact re details of this database. Information on the RDF dataset derived from FlyTED is at: http://code.google.com/p/openflydata/wiki/Flyted Jun wrote a Java program to harvest metadata from FlyTED via OAI-PMH, then convert it to Turtle. She is currently handling some final updates to the database, but after that we expect the database to remain static. ---- A few last words... I hope the excellent work of the IG on the integrated knowledge base continues, and we see a much expanded coverage of linked datasets across the life science domains, made available via robust and performant SPARQL endpoints. I would particularly like to emphasise the central role played by model organism databases such as FlyBase. I would love to see stable, well-engineered, and up-to-date RDF conversions available for all the major model organism databases, which could then act as a hub for linking the large number of peripheral databases. One of our biggest challenges in FlyWeb has been dealing with the vulnerability of open sparql endpoints to denial-of-service-type problems. We explored some ideas for mitigating these problems via the experimental sparqlite sparql protocol implementation, and we have found the Jena TDB storage and query engine to perform well, however we are conscious that we only have partial solutions at best. SPARQL is compelling because it provides an expressive, open-ended query protocol, supporting a wide range of requirements. However, if service-level guarantees cannot be provided for open sparql endpoints, it is hard to make a firm business case for migrating production systems. I hope we see this resolved in open-source implementations of the sparql protocol in the not-too-distant future. If someone has solved this already, then I'd love to hear about it! Best wishes, Alistair [1] http://lists.w3.org/Archives/Public/public-semweb-lifesci/2009May/0031.html [2] http://flybase.org/forums/viewtopic.php?f=4&t=110&sid=c234dd240ffdc57ab2db75e5f5408815 [3] http://gmod.org/wiki/Chado [4] http://www.genedb.org/ [5] http://gmod.org/wiki/Public_Chado_Databases [6] http://www.fruitfly.org/cgi-bin/ex/insitu.pl [7] http://www.affymetrix.com/support/technical/byproduct.affx?product=fly-20 -- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.miles@zoo.ox.ac.uk Tel: +44 (0)1865 281993
Received on Tuesday, 12 May 2009 17:29:30 UTC