- From: Kei Cheung <kei.cheung@yale.edu>
- Date: Thu, 14 May 2009 10:53:05 -0400
- To: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
- Cc: public-semweb-lifesci@w3.org
Just want to share the following news article: http://www.medicalnewstoday.com/articles/53437.php The title of the article is: "A Valuale Fly For Research Into Cancer, Drug Addiction, Neurodegenerative Diseases, Epilepsy And More" Cheers, -Kei Alistair Miles wrote: > Dear all, > > As I mentioned in the FlyWeb announcement email earlier today [1], the > FlyWeb project is currently winding up. While the current set of > applications and services, including the SPARQL endpoints, will > persist in their current state at least until May 2010, we do not have > continued funding to extend this work, or to track changes to the > underlying data sources. > > This email provides some details of the four main RDF datasets > associated with Drosophila (fruit flies) that we have produced. If the > IG sees fit, I would be more than happy for these datasets to be > incorporated and maintained within the HCLS KB. I would also be happy > for any member(s) of the IG to take this up independently. > > == FlyBase == > > The largest RDF dataset we have generated is derived from the FlyBase > Drosophila genome database (flybase.org). FlyBase contains a diverse > set of well-curated genome-associated data, and is the primary > resource for Drosophila genomics. > > The dataset released in FM3 is based on FlyBase version > FB2009_02. FlyBase release a new version of their underlying database > roughly once per month [2], and the current version is FB2009_04, so > we are already two versions behind. However, afaik the schema hasn't > changed, so the D2RQ maps should still be applicable. > > Our FM3 FlyBase dataset is ~175 million triples. Full details of the > dataset, with links to downloads, D2RQ maps used to generate the > dataset, details of URI design, and location of SPARQL endpoint, are > described at: > > http://code.google.com/p/openflydata/wiki/FlyBaseMilestone3 > > The D2RQ maps used to generate the data are divided into a number of > separate mapping files, based around the modular structure of the > Chado schema [3]. These mapping files are currently available from the > openflydata code project: > > http://openflydata.googlecode.com/ > > Specifically, all of the D2RQ maps for FlyBase are under the > trunk/chado svn repository path. They are *not* under the > trunk/flybase repository path -- that is earlier work, now superseded. > > Note that similar D2RQ mapping files are also available for GeneDB, > which holds genomic data for the Sanger Pathogen Sequencing Unit (37 > parasite genomes). GeneDB also has a publicly accessible database > instance based on the Chado schema [5], so I ported the D2RQ maps for > FlyBase to GeneDB, as an experiment. > > We used D2R server's dump-rdf utility to generate N-TRIPLES dumps from > FlyBase. Note that, at the time of writing, D2R server's dump-rdf > utility has some scalability limitations. I encountered no problems > with the smaller GeneDB on normal desktop machines, but when working > with the larger FlyBase I had to use machines (m1.xlarge ec2 > instances) with a lot of RAM to get the transformation to > complete. This is entirely due to the fact that D2R server doesn't > make use of the JDBC capability to fetch SQL results a bit at a time, > based on a cursor, rather than fetching the whole thing in one > go. Recently I submitted a patch to the D2R team which fixes this, > enabling any of the mappings to be run on a much smaller machine, > which may see its way into the next D2R release. > > As a point of interest, some data comparing performance of a > TDB-backed SPARQL endpoint with the FlyBase relational database for > some comparable SQL and SPARQL queries are at: > > http://code.google.com/p/openflydata/wiki/FlyBaseBenchmark > > == BDGP In Situ Database == > > The Berkeley Drosophila Genome Project (BDGP) (fruitfly.org) maintains > a public database of mRNA in situ hybridisation images in Drosophila > embryos at different stages of embryo development [6]. This is an > extremely valuable source of gene expression data for Drosophila > functional genomics. > > Details of our latest release of an RDF dataset derived from the BDGP > in situ database are available at: > > http://code.google.com/p/openflydata/wiki/Bdgp > > D2RQ maps for this database are available from the openflydata code > project: > > http://openflydata.googlecode.com/ > > See the trunk/bdgp path in the svn repository. Jun Zhao was leading > the work on BDGP, she can answer any further queries regarding this > dataset. > > The BDGP database changes much less frequently than FlyBase - afaik > the database hasn't changed since 20070309. > > == FlyAtlas == > > FlyAtlas (flyatlas.org) is an online database of tissue-specific DNA > microarray data for Drosophila. It is complementary to BDGP, providing > quantitative data on gene expression in a number of adult and larval > tissues. This is also an invaluable gene expression data source for > Drosophila functional genomics. > > FlyAtlas provide a spreadsheet download of their data. We (Graham > Klyne) wrote a Python conversion utility that parses the spreadsheet > and outputs a Turtle format RDF dump. Further details are available > at: > > http://code.google.com/p/openflydata/wiki/Flyatlas > > The last update to FlyAtlas was in November last year, when data on 5 > new tissues were added. Our current RDF dataset is from the previous > FlyAtlas release. We haven't updated our scripts to cope with the > newer data. > > Note that, to link FlyAtlas data to FlyBase data you need probe > annotation tables from Affymetrix. The tables, also available as a > spreadsheet download, map Drosophila 2 microarray probe identifiers to > FlyBase gene identifiers. We wrote another Python script to convert > that to N-TRIPLES, which we merged with the FlyAtlas data. That script > (Probe2Gene.py) is also available from the same location in the > openflydata code project. Affymetrix do periodically release updates > to that table, and our latest dataset is not the most current, see [7] > for latest. > > == FlyTED == > > FlyTED is the Drosophila Testis Gene Expression Database, publishing > images of mRNA in situ hybridisation in Drosophila testes for several > hundred genes. It is a valuable resource for a more specific aspect of > Drosophila developmental biology (spermatogenesis). > > FlyTED was developed and is maintained by Jun Zhao, so she is the best > person to contact re details of this database. Information on the RDF > dataset derived from FlyTED is at: > > http://code.google.com/p/openflydata/wiki/Flyted > > Jun wrote a Java program to harvest metadata from FlyTED via OAI-PMH, > then convert it to Turtle. She is currently handling some final > updates to the database, but after that we expect the database to > remain static. > > ---- > > A few last words... > > I hope the excellent work of the IG on the integrated knowledge base > continues, and we see a much expanded coverage of linked datasets > across the life science domains, made available via robust and > performant SPARQL endpoints. > > I would particularly like to emphasise the central role played by > model organism databases such as FlyBase. I would love to see stable, > well-engineered, and up-to-date RDF conversions available for all the > major model organism databases, which could then act as a hub for > linking the large number of peripheral databases. > > One of our biggest challenges in FlyWeb has been dealing with the > vulnerability of open sparql endpoints to denial-of-service-type > problems. We explored some ideas for mitigating these problems via the > experimental sparqlite sparql protocol implementation, and we have > found the Jena TDB storage and query engine to perform well, however > we are conscious that we only have partial solutions at best. SPARQL > is compelling because it provides an expressive, open-ended query > protocol, supporting a wide range of requirements. However, if > service-level guarantees cannot be provided for open sparql endpoints, > it is hard to make a firm business case for migrating production > systems. I hope we see this resolved in open-source implementations of > the sparql protocol in the not-too-distant future. If someone has > solved this already, then I'd love to hear about it! > > Best wishes, > > Alistair > > [1] http://lists.w3.org/Archives/Public/public-semweb-lifesci/2009May/0031.html > [2] http://flybase.org/forums/viewtopic.php?f=4&t=110&sid=c234dd240ffdc57ab2db75e5f5408815 > [3] http://gmod.org/wiki/Chado > [4] http://www.genedb.org/ > [5] http://gmod.org/wiki/Public_Chado_Databases > [6] http://www.fruitfly.org/cgi-bin/ex/insitu.pl > [7] http://www.affymetrix.com/support/technical/byproduct.affx?product=fly-20 > >
Received on Thursday, 14 May 2009 14:53:51 UTC