- From: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
- Date: Wed, 17 Dec 2008 18:03:41 +0000
- To: Chris Mungall <cjm@berkeleybop.org>
- Cc: public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>, David Sutherland <djs93@gen.cam.ac.uk>
Hi Chris, A few more comments, to add to Jun's... On Wed, Nov 05, 2008 at 01:34:46PM -0800, Chris Mungall wrote: > > Hi Alistair > > The UI is very nice! Thanks! Btw any suggestions or ideas very welcome. > I'm curious that you don't include any ontologies. The source datasets > are quite ontology-centric (the Chado database in particular). The BDGP > data includes annotation of each individual image with terms from > fly_anatomy. This allows you to query for genes expressed in the brain > (including its parts), or expressed in tissue derived from the > neurectoderm for example. We hope to make good use of the ontology annotations in chado and bdgp in the near future. As Jun mentioned, for now we've used a minimal subset of the data required to build the applications we have so far. We're working incrementally, trying to focus on delivering functionality. > A while back I created a D2RQ mapping of both the BDGP InSitu databases > and Chado. See: > > http://www.bioontology.org/wiki/index.php/OBD:SPARQL-InSitu > > My approach was slightly different in that I was aiming for an > ontologically sound representation rather than simply recapitulating the > schema in RDFS. In retrospect, this was probably a little over ambitious > given the limitations of RDF technology and D2RQ in particular. Things > may change when we have more OWL-centric databases and SQL mapping > technology. As Jun mentioned, your original d2r mapping was the starting point for our work, we learned a lot from it. > In your Chado mapping, you're really just extracting synonym > information. Is there really a need to define a new ontology here, > rather than using, say, SKOS? Do you have plans to map more of the > schema? I'm particularly interested in the representation of genomic > intervals, and scalable querying. For databases like bdgp and chado, we are taking an approach where we first develop a systematic, "faithful" mapping from the relational schema to rdf, trying to avoid making any assumptions or interpretation regarding the semantics intended in the relational schema. This gives us a method to work with, and means we can get going with the data asap. At a later stage we can then either adopt or map to existing ontologies as we develop a deeper understanding of the semantics of the data. The other up-side, possibly, is that, for a chado rdf schema that is a systematic translation of the chado relational schema, SPARQL queries will be predictable and look very similar to comparable sql queries, which may help existing chado uses experiment with sparql. For the chado synonym data, we needed to distinguish between symbols, annotation ids, flybase ids, full name synonyms and other synonyms. SKOS doesn't really do enough to help us with that. Plus we needed to link sequence features (genes) to organisms, i.e. we need a couple of classes. We do plan to map more of chado into RDF, as much as possible, it's such a fantastic resource. However, as Jun mentioned, we are trying to be disciplined and only do what we need to deliver functionality in our cross-database search tools, so we're unlikely to produce a complete mappping for chado to rdf (as much as we'd like to). > You provide 3 SPARQL endpoints. It looks like you're doing the mashup in > the UI. In many ways this is a traditional AJAX architecture, albeit with > SPARQL endpoints rather than, say, a REST interface to a relational db. Yes, exactly. > Did you find the triplestore/SPARQL route had particular advantages (or > disadvantages)? What can you do that you can't do by simply going > straight to the relational dbs? The main up side we find is that SPARQL gives you an AJAX API for free. I.e. in theory you don't need to do any server side coding or design or maintain your own REST API. All you have to do is convert your data to RDF, load into a triplestore that implements SPARQL protocol, and you're ready to start coding in the browser. You do have to do a little more work in the browser to turn the SPARQL result set into useful objects, but it's no great overhead. The other upside from the community perspective is that SPARQL provides a great deal of flexibility in terms of the questions you can ask of the data, which may be serendipitous for third parties seeking to re-use data in a variety of possibly unforeseen ways (but that remains to be seen in practice :) Actually, I should say that we did not presume to use RDF or SPARQL at any stage. For each dataset, we knew we wanted to provide an API to the data, so we asked, what's the quickest and easiest way we can get an API up and running? In all cases, converting to RDF then using a triplestore looked at least as cheap, if not cheaper, than doing our own relational db / REST thing. Where the two approaches looked comparable in effort, we favoured SPARQL because we wanted to contribute to the HCLS community efforts. For the queries we needed, we found performance of SPARQL query evaluation (using Jena TDB 0.6) is perfectly adequate to do AJAX apps. The FlyBase gene names dataset is ~ 10 million triples, that's our biggest dataset so far. The down side of SPARQL is offering quality-of-service assurances, see below... > I'm not sure why you needed to write your own SPARQL protocol on top of > Jena. Isn't this what Joseki does? Yes. This is a bit of a long story, I'll I try and give the potted version. We started using Jena RDB over postgres with joseki. Jena RDB is the older relational layout, which can be slow for SPARQL queries. To get better query performance, we moved to Jena SDB, which is optimised for SPARQL, with Joseki as the protocol implementation. SDB provided good performance for our main queries, but quite early on we started hitting java out of memory errors with some of our test queries. The most obvious one is "SELECT * WHERE { ?s ?p ?o }" -- i.e. get all the triples out of the database. We realised that, even if you use a persistent store at the back end, you still get memory bottlenecks, because JDBC by default does not stream result sets. So the protocol layer ends up building an in-memory representation of the result set, which for queries over a large dataset can be at least as big as the dataset itself. We realised that our platform was vulnerable to, perhaps uninentional, denial-of-service type attacks. This is an issue because of course we would like to make the ajax apps we build as reliable and available as possible. You can get SDB to stream end-to-end over jdbc with postgres behind, with some custom configuration. With help from Andy Seaborne at HP we got that working in joseki. But at the time joseki introduced a feature whereby each request was wrapped in a transaction. This was for the general case where data might be being updated via another application at the same time as queried, and one SPARQL query might be evlauated as several SQL queries, so you'd want a consistent view of the data. The problem was that joseki uses a single jdbc connection, not connection pooling, so when you use transactions you effectively prevent concurrent requests. Obviously our sparql endpoints needed to serve concurrent requests on the same datasource. So to work around those issues at the time, I hacked up a SPARQL protocol implementation -- SPARQLite -- designed to work with Jena SDB and postgres, but using database connection pooling, and configured by default to stream end-to-end without using transactions. That worked fine until we started working with the flybase dataset, which at 10 million triples was larger than the datasets we'd used previously (bdgp was ~1m). We got prohibitively slow load times on our own rather underpowered vmware virtual server running on fairly old hardware. For flybase the 10 million triples is the tip of the iceberg, so we knew we needed better load times. I contacted Andy about this and he suggested trying the new Jena TDB native triplestore. To prove his point, he reported loading our 10m triples into a TDB store in ~300s on his own 64 bit hardware. So we started working with Jena TDB, and also experimenting with Amazon EC2, to see if we could get load performance that would be adequate for datasets up to hundreds of millions of triples. Btw Jena TDB is excellent, really simple to use and quick. Joseki does work with TDB, but we decided to bake TDB support into SPARQLite and continue using that for the short term. Probably the main reason we've kept on with SPARQLite for the moment is that we have full control over the internals. This means we've been able to experiment with other quality of service features, like configurable query policies. To explain this point a little, we've also found that, even with TDB running on an EC2 instance, you can still ask hard queries that take a long time (minutes) to evaluate. These are typically queries with FILTERs, where the graph pattern is not very selective so the query ends up sucking lots of triples out of the store before applying the filters. Queries like OPTIONAL .... FILTER !bound(..) are the main culprits, which is a pain because these are very useful for validating data (looking for missing data). So at the moment, we use SPARQLite to implement restrictions on SPARQL queries for public endpoints, as part of an ongoing desire to make our services as robust as possible and to reduce vulnerabilities to any denial-of-service type problems. At the moment SPARQLite works with either Jena SDB or TDB, but we only use TDB. Andy has suggested that, rather than restrict the queries, we should rather just implement a timeout on query evaluation, i.e. if a query takes longer than n seconds, kill it. If we get time, we'll certainly be exploring that. But SPARQL implementation is not really our main focus, we'd rather spend more time working with the data and building applications. So hopefully that explains the "why SPARQLite" :) Longer term we'll probably go back to something off-the-shelf, we don't really have the bandwidth to maintain SPARQLite. Cheers, Alistair > > Interested to see future developments > > Cheers > Chris > > Cheers > Chris > > On Nov 5, 2008, at 8:44 AM, Alistair Miles wrote: > >> >> Dear all, >> >> This is a summary of work so far by the FlyWeb Project team. We're >> exploring integration of life science data in support of Drosophila >> (fruit fly) functional genomics. We'd like to develop credible, robust >> and genuinely useful tools for the Drosophila research community; and >> to provide data and services of value to bioinformaticians and >> Semantic Web / Life Science developers. >> >> This is the first time we've announced our work more widely, and we'd >> very much appreciate thoughts, suggestions, feedback, re-use and >> testing of the applications, services, software and data described >> below. Please note however that this is work in progress, and things >> may break, change, move or disappear without notice. >> >> >> = Search Applications = >> >> http://openflydata.org/search/insitus >> >> This application allows you to search for images of in situ RNA >> hybridisation experiments, depicting expression of specific genes in >> different organs (testes and embryos). It is a mashup of data from the >> Berkeley Drosophila Genome Project (BDGP) and the Drosophila Testis >> Gene Expression Database (Fly-TED). It also uses data from FlyBase to >> disambiguate gene name synonyms. >> >> It's a pure AJAX application using SPARQL to access data from each of >> the three sources on the fly (pardon the pun :). >> >> >> = RDF Data = >> >> The following RDF data used in the search application above are >> available for bulk download: >> >> * http://openflydata.org/dump/flybase (latest) >> http://openflydata.org/dump/flybase_genenames_20081017 (snapshot) >> >> data on D. melanogaster gene identifiers, symbols and synonyms, >> derived from flybase.org; approx 8 million triples; gzipped >> n-triples >> >> * http://openflydata.org/dump/bdgp (latest) >> http://openflydata.org/dump/bdgp_images_20081030 (snapshot) >> >> metadata on images of embryo in situ gene expression experiments, >> derived from fruitfly.org; approx 1 million triples; gzipped >> n-triples >> >> * http://openflydata.org/dump/flyted (latest) >> http://openflydata.org/dump/flyted_20080626 (snapshot) >> >> metadata on images testis in situ gene expression experiments, >> derived from www.fly-ted.org; approx 30,000 triples; gzipped turtle >> >> >> = Data Services = >> >> The following SPARQL endpoints are available for queries over the >> above data. See also limitations below. >> >> * http://openflydata.org/query/flybase (latest) >> http://openflydata.org/query/flybase_genenames_20081017 (snapshot) >> >> * http://openflydata.org/query/bdgp (latest) >> http://openflydata.org/query/bdgp_images_20081030 (snapshot) >> >> * http://openflydata.org/query/flyted (latest) >> http://openflydata.org/query/flyted_20080626 (snapshot) >> >> Limitations: only GET requests are supported; only SELECT and ASK >> queries are supported; only JSON results format is supported (request >> must specify output=json); SELECT queries are limited to max 500 >> results; no more than 5 requests per second from any one origin >> >> The endpoints are implemented using our own Java SPARQL protocol >> implementation (SPARQLite, see below) backed by Jena TDB 0.6 >> stores. The endpoints run inside Tomcat 5.5 behind Apache 2.2 via >> mod_jk, on a small EC2 instance, with TDB storing data on an attached >> EBS volume. >> >> >> = Software Downloads & Source Code = >> >> * FlyUI >> http://flyui.googlecode.com >> >> This is a library of composable javascript widgets, providing a >> user-interface to above data. These widgets are used to build the >> search application above. FlyUI is built on YAHOO's javascript user >> interface library (YUI). >> >> * SPARQLite >> http://sparqlite.googlecode.com >> >> This is an experimental and incomplete implementation of the SPARQL >> protocol, designed to work with Jena TDB or SDB stores. We're using >> this as a platform to explore a number of quality of service issues >> that SPARQL raises. >> >> >> = Ontologies/Schemas = >> >> The following OWL schemas are used in the above data: >> >> * CHADO OWL Schema >> http://purl.org/net/chado/schema/ >> >> This is an OWL representation of a subset of the CHADO relational >> schema used by FlyBase (see http://gmod.org/wiki/Schema). >> >> * FlyBase OWL Synonym Types >> http://purl.org/net/flybase/synonym-types/ >> >> This is a micro-ontology, representing the FlyBase synonym type >> vocabulary. >> >> * BDGP OWL Schema >> http://purl.org/net/bdgp/schema/ >> >> This is an OWL representation of a subset of the BDGP relational >> schema. >> >> * FlyTED OWL Schemas >> >> These are under revision, to be published shortly. >> >> >> = RDF Data Conversion Utilities = >> >> The following utilities were developed to obtain the RDF data >> described above: >> >> * CHADO/FlyBase D2RQ Map >> http://code.google.com/p/openflydata/source/browse/trunk/flybase/genenames/d2r-flybase-genenames.ttl >> >> This provides a mapping from the CHADO/FlyBase relational schema to >> the CHADO/FlyBase OWL ontologies, for basic D. melanogaster gene >> (feature) data (identifiers, symbols, synonyms, species). >> >> * BDGP D2RQ Map >> http://code.google.com/p/openflydata/source/browse/trunk/bdgp/imagemapping/d2r-bdgp-insituimages.ttl >> >> This maps the BDGP relational schema to OWL/RDF. >> >> See also: http://openflydata.googlecode.com >> >> >> = Future Developments = >> >> We're currently working on improving the user interface to the BDGP >> data (grouping and ordering images by developmental stage) and on >> integrated expression level data from FlyAtlas. >> >> Other suggestions for future developments are warmly welcomed. >> >> >> = Acknowledgments = >> >> Thanks especially to Helen White-Cooper and Andy Seaborne for all >> their help. >> >> The FlyWeb Project is funded by the UK Joint Information Systems >> Committee (JISC). >> >> >> = Further Information = >> >> The FlyWeb project website is at: >> >> http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_project >> >> Graham will be presenting this work at the UK SWIG meeting next week. >> >> Or send us an email :) >> >> Kind regards, >> >> Alistair Miles >> Jun Zhao >> Graham Klyne >> David Shotton >> >> >> -- >> Alistair Miles >> Senior Computing Officer >> Image Bioinformatics Research Group >> Department of Zoology >> The Tinbergen Building >> University of Oxford >> South Parks Road >> Oxford >> OX1 3PS >> United Kingdom >> Web: http://purl.org/net/aliman >> Email: alistair.miles@zoo.ox.ac.uk >> Tel: +44 (0)1865 281993 >> >> >> > > > -- Alistair Miles Senior Computing Officer Image Bioinformatics Research Group Department of Zoology The Tinbergen Building University of Oxford South Parks Road Oxford OX1 3PS United Kingdom Web: http://purl.org/net/aliman Email: alistair.miles@zoo.ox.ac.uk Tel: +44 (0)1865 281993
Received on Wednesday, 17 December 2008 18:04:19 UTC