Re: [ANN] News from the FlyWeb Project from Alistair Miles on 2008-12-17 (public-semweb-lifesci@w3.org from December 2008)

From: Alistair Miles <alistair.miles@zoo.ox.ac.uk>
Date: Wed, 17 Dec 2008 18:03:41 +0000
To: Chris Mungall <cjm@berkeleybop.org>
Cc: public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>, David Sutherland <djs93@gen.cam.ac.uk>
Message-ID: <20081217180338.GA8161@skiathos>
Hi Chris,

A few more comments, to add to Jun's...

On Wed, Nov 05, 2008 at 01:34:46PM -0800, Chris Mungall wrote:
>
> Hi Alistair
>
> The UI is very nice!

Thanks! Btw any suggestions or ideas very welcome.

> I'm curious that you don't include any ontologies. The source datasets  
> are quite ontology-centric (the Chado database in particular). The BDGP 
> data includes annotation of each individual image with terms from  
> fly_anatomy. This allows you to query for genes expressed in the brain  
> (including its parts), or expressed in tissue derived from the  
> neurectoderm for example.

We hope to make good use of the ontology annotations in chado and bdgp
in the near future. As Jun mentioned, for now we've used a minimal
subset of the data required to build the applications we have so
far. We're working incrementally, trying to focus on delivering
functionality. 

> A while back I created a D2RQ mapping of both the BDGP InSitu databases 
> and Chado. See:
>
> 	http://www.bioontology.org/wiki/index.php/OBD:SPARQL-InSitu
>
> My approach was slightly different in that I was aiming for an  
> ontologically sound representation rather than simply recapitulating the 
> schema in RDFS. In retrospect, this was probably a little over ambitious 
> given the limitations of RDF technology and D2RQ in particular. Things 
> may change when we have more OWL-centric databases and SQL mapping 
> technology.

As Jun mentioned, your original d2r mapping was the starting point for
our work, we learned a lot from it. 

> In your Chado mapping, you're really just extracting synonym  
> information. Is there really a need to define a new ontology here,  
> rather than using, say, SKOS? Do you have plans to map more of the  
> schema? I'm particularly interested in the representation of genomic  
> intervals, and scalable querying.

For databases like bdgp and chado, we are taking an approach where we
first develop a systematic, "faithful" mapping from the relational
schema to rdf, trying to avoid making any assumptions or
interpretation regarding the semantics intended in the relational
schema. This gives us a method to work with, and means we can get
going with the data asap. At a later stage we can then either adopt or
map to existing ontologies as we develop a deeper understanding of the
semantics of the data.

The other up-side, possibly, is that, for a chado rdf schema that is a
systematic translation of the chado relational schema, SPARQL queries
will be predictable and look very similar to comparable sql queries,
which may help existing chado uses experiment with sparql.

For the chado synonym data, we needed to distinguish between symbols,
annotation ids, flybase ids, full name synonyms and other
synonyms. SKOS doesn't really do enough to help us with that. Plus we
needed to link sequence features (genes) to organisms, i.e. we need a
couple of classes.

We do plan to map more of chado into RDF, as much as possible, it's
such a fantastic resource. However, as Jun mentioned, we are trying to
be disciplined and only do what we need to deliver functionality in
our cross-database search tools, so we're unlikely to produce a
complete mappping for chado to rdf (as much as we'd like to).

> You provide 3 SPARQL endpoints. It looks like you're doing the mashup in 
> the UI. In many ways this is a traditional AJAX architecture, albeit with 
> SPARQL endpoints rather than, say, a REST interface to a relational db. 

Yes, exactly.

> Did you find the triplestore/SPARQL route had particular advantages (or 
> disadvantages)? What can you do that you can't do by simply going 
> straight to the relational dbs?

The main up side we find is that SPARQL gives you an AJAX API for
free. I.e. in theory you don't need to do any server side coding or
design or maintain your own REST API. All you have to do is convert
your data to RDF, load into a triplestore that implements SPARQL
protocol, and you're ready to start coding in the browser. You do have
to do a little more work in the browser to turn the SPARQL result set
into useful objects, but it's no great overhead.

The other upside from the community perspective is that SPARQL
provides a great deal of flexibility in terms of the questions you can
ask of the data, which may be serendipitous for third parties seeking
to re-use data in a variety of possibly unforeseen ways (but that
remains to be seen in practice :)

Actually, I should say that we did not presume to use RDF or SPARQL at
any stage. For each dataset, we knew we wanted to provide an API to
the data, so we asked, what's the quickest and easiest way we can get
an API up and running? In all cases, converting to RDF then using a
triplestore looked at least as cheap, if not cheaper, than doing our
own relational db / REST thing. Where the two approaches looked
comparable in effort, we favoured SPARQL because we wanted to
contribute to the HCLS community efforts.

For the queries we needed, we found performance of SPARQL query
evaluation (using Jena TDB 0.6) is perfectly adequate to do AJAX
apps. The FlyBase gene names dataset is ~ 10 million triples, that's
our biggest dataset so far. 

The down side of SPARQL is offering quality-of-service assurances, see
below...

> I'm not sure why you needed to write your own SPARQL protocol on top of 
> Jena. Isn't this what Joseki does?

Yes. This is a bit of a long story, I'll I try and give the potted
version.

We started using Jena RDB over postgres with joseki. Jena RDB is the
older relational layout, which can be slow for SPARQL queries. To get
better query performance, we moved to Jena SDB, which is optimised for
SPARQL, with Joseki as the protocol implementation.

SDB provided good performance for our main queries, but quite early on
we started hitting java out of memory errors with some of our test
queries. The most obvious one is "SELECT * WHERE { ?s ?p ?o }" --
i.e. get all the triples out of the database. We realised that, even
if you use a persistent store at the back end, you still get memory
bottlenecks, because JDBC by default does not stream result sets. So
the protocol layer ends up building an in-memory representation of the
result set, which for queries over a large dataset can be at least as
big as the dataset itself.

We realised that our platform was vulnerable to, perhaps uninentional,
denial-of-service type attacks. This is an issue because of course we
would like to make the ajax apps we build as reliable and available as
possible.

You can get SDB to stream end-to-end over jdbc with postgres behind,
with some custom configuration. With help from Andy Seaborne at HP we
got that working in joseki. But at the time joseki introduced a
feature whereby each request was wrapped in a transaction. This was
for the general case where data might be being updated via another
application at the same time as queried, and one SPARQL query might be
evlauated as several SQL queries, so you'd want a consistent view of
the data. The problem was that joseki uses a single jdbc connection,
not connection pooling, so when you use transactions you effectively
prevent concurrent requests. Obviously our sparql endpoints needed to
serve concurrent requests on the same datasource.

So to work around those issues at the time, I hacked up a SPARQL
protocol implementation -- SPARQLite -- designed to work with Jena SDB
and postgres, but using database connection pooling, and configured by
default to stream end-to-end without using transactions.

That worked fine until we started working with the flybase dataset,
which at 10 million triples was larger than the datasets we'd used
previously (bdgp was ~1m). We got prohibitively slow load times on our
own rather underpowered vmware virtual server running on fairly old
hardware. For flybase the 10 million triples is the tip of the
iceberg, so we knew we needed better load times. I contacted Andy
about this and he suggested trying the new Jena TDB native
triplestore. To prove his point, he reported loading our 10m triples
into a TDB store in ~300s on his own 64 bit hardware.

So we started working with Jena TDB, and also experimenting with
Amazon EC2, to see if we could get load performance that would be
adequate for datasets up to hundreds of millions of triples. 

Btw Jena TDB is excellent, really simple to use and quick.

Joseki does work with TDB, but we decided to bake TDB support into
SPARQLite and continue using that for the short term. Probably the
main reason we've kept on with SPARQLite for the moment is that we
have full control over the internals. This means we've been able to
experiment with other quality of service features, like configurable
query policies.

To explain this point a little, we've also found that, even with TDB
running on an EC2 instance, you can still ask hard queries that take a
long time (minutes) to evaluate. These are typically queries with
FILTERs, where the graph pattern is not very selective so the query
ends up sucking lots of triples out of the store before applying the
filters. Queries like OPTIONAL .... FILTER !bound(..) are the main
culprits, which is a pain because these are very useful for validating
data (looking for missing data).

So at the moment, we use SPARQLite to implement restrictions on SPARQL
queries for public endpoints, as part of an ongoing desire to make our
services as robust as possible and to reduce vulnerabilities to any
denial-of-service type problems. At the moment SPARQLite works with
either Jena SDB or TDB, but we only use TDB.

Andy has suggested that, rather than restrict the queries, we should
rather just implement a timeout on query evaluation, i.e. if a query
takes longer than n seconds, kill it. If we get time, we'll certainly
be exploring that. But SPARQL implementation is not really our main
focus, we'd rather spend more time working with the data and building
applications.

So hopefully that explains the "why SPARQLite" :) Longer term we'll
probably go back to something off-the-shelf, we don't really have the
bandwidth to maintain SPARQLite.

Cheers,

Alistair

>
> Interested to see future developments
>
> Cheers
> Chris
>
> Cheers
> Chris
>
> On Nov 5, 2008, at 8:44 AM, Alistair Miles wrote:
>
>>
>> Dear all,
>>
>> This is a summary of work so far by the FlyWeb Project team. We're
>> exploring integration of life science data in support of Drosophila
>> (fruit fly) functional genomics. We'd like to develop credible, robust
>> and genuinely useful tools for the Drosophila research community; and
>> to provide data and services of value to bioinformaticians and
>> Semantic Web / Life Science developers.
>>
>> This is the first time we've announced our work more widely, and we'd
>> very much appreciate thoughts, suggestions, feedback, re-use and
>> testing of the applications, services, software and data described
>> below. Please note however that this is work in progress, and things
>> may break, change, move or disappear without notice.
>>
>>
>> = Search Applications =
>>
>> http://openflydata.org/search/insitus
>>
>> This application allows you to search for images of in situ RNA
>> hybridisation experiments, depicting expression of specific genes in
>> different organs (testes and embryos). It is a mashup of data from the
>> Berkeley Drosophila Genome Project (BDGP) and the Drosophila Testis
>> Gene Expression Database (Fly-TED). It also uses data from FlyBase to
>> disambiguate gene name synonyms.
>>
>> It's a pure AJAX application using SPARQL to access data from each of
>> the three sources on the fly (pardon the pun :).
>>
>>
>> = RDF Data =
>>
>> The following RDF data used in the search application above are
>> available for bulk download:
>>
>> * http://openflydata.org/dump/flybase (latest)
>>  http://openflydata.org/dump/flybase_genenames_20081017 (snapshot)
>>
>>  data on D. melanogaster gene identifiers, symbols and synonyms,
>>  derived from flybase.org; approx 8 million triples; gzipped
>>  n-triples
>>
>> * http://openflydata.org/dump/bdgp (latest)
>>  http://openflydata.org/dump/bdgp_images_20081030 (snapshot)
>>
>>  metadata on images of embryo in situ gene expression experiments,
>>  derived from fruitfly.org; approx 1 million triples; gzipped
>>  n-triples
>>
>> * http://openflydata.org/dump/flyted (latest)
>>  http://openflydata.org/dump/flyted_20080626 (snapshot)
>>
>>  metadata on images testis in situ gene expression experiments,
>>  derived from www.fly-ted.org; approx 30,000 triples; gzipped turtle
>>
>>
>> = Data Services =
>>
>> The following SPARQL endpoints are available for queries over the
>> above data. See also limitations below.
>>
>> * http://openflydata.org/query/flybase (latest)
>>  http://openflydata.org/query/flybase_genenames_20081017 (snapshot)
>>
>> * http://openflydata.org/query/bdgp (latest)
>>  http://openflydata.org/query/bdgp_images_20081030 (snapshot)
>>
>> * http://openflydata.org/query/flyted (latest)
>>  http://openflydata.org/query/flyted_20080626 (snapshot)
>>
>> Limitations: only GET requests are supported; only SELECT and ASK
>> queries are supported; only JSON results format is supported (request
>> must specify output=json); SELECT queries are limited to max 500
>> results; no more than 5 requests per second from any one origin
>>
>> The endpoints are implemented using our own Java SPARQL protocol
>> implementation (SPARQLite, see below) backed by Jena TDB 0.6
>> stores. The endpoints run inside Tomcat 5.5 behind Apache 2.2 via
>> mod_jk, on a small EC2 instance, with TDB storing data on an attached
>> EBS volume.
>>
>>
>> = Software Downloads & Source Code =
>>
>> * FlyUI
>>  http://flyui.googlecode.com
>>
>> This is a library of composable javascript widgets, providing a
>> user-interface to above data. These widgets are used to build the
>> search application above. FlyUI is built on YAHOO's javascript user
>> interface library (YUI).
>>
>> * SPARQLite
>>  http://sparqlite.googlecode.com
>>
>> This is an experimental and incomplete implementation of the SPARQL
>> protocol, designed to work with Jena TDB or SDB stores. We're using
>> this as a platform to explore a number of quality of service issues
>> that SPARQL raises.
>>
>>
>> = Ontologies/Schemas =
>>
>> The following OWL schemas are used in the above data:
>>
>> * CHADO OWL Schema
>>  http://purl.org/net/chado/schema/
>>
>> This is an OWL representation of a subset of the CHADO relational
>> schema used by FlyBase (see http://gmod.org/wiki/Schema).
>>
>> * FlyBase OWL Synonym Types
>>  http://purl.org/net/flybase/synonym-types/
>>
>> This is a micro-ontology, representing the FlyBase synonym type
>> vocabulary.
>>
>> * BDGP OWL Schema
>>  http://purl.org/net/bdgp/schema/
>>
>> This is an OWL representation of a subset of the BDGP relational
>> schema.
>>
>> * FlyTED OWL Schemas
>>
>> These are under revision, to be published shortly.
>>
>>
>> = RDF Data Conversion Utilities =
>>
>> The following utilities were developed to obtain the RDF data
>> described above:
>>
>> * CHADO/FlyBase D2RQ Map
>>  http://code.google.com/p/openflydata/source/browse/trunk/flybase/genenames/d2r-flybase-genenames.ttl
>>
>> This provides a mapping from the CHADO/FlyBase relational schema to
>> the CHADO/FlyBase OWL ontologies, for basic D. melanogaster gene
>> (feature) data (identifiers, symbols, synonyms, species).
>>
>> * BDGP D2RQ Map
>>  http://code.google.com/p/openflydata/source/browse/trunk/bdgp/imagemapping/d2r-bdgp-insituimages.ttl
>>
>> This maps the BDGP relational schema to OWL/RDF.
>>
>> See also: http://openflydata.googlecode.com
>>
>>
>> = Future Developments =
>>
>> We're currently working on improving the user interface to the BDGP
>> data (grouping and ordering images by developmental stage) and on
>> integrated expression level data from FlyAtlas.
>>
>> Other suggestions for future developments are warmly welcomed.
>>
>>
>> = Acknowledgments =
>>
>> Thanks especially to Helen White-Cooper and Andy Seaborne for all
>> their help.
>>
>> The FlyWeb Project is funded by the UK Joint Information Systems
>> Committee (JISC).
>>
>>
>> = Further Information =
>>
>> The FlyWeb project website is at:
>>
>> http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_project
>>
>> Graham will be presenting this work at the UK SWIG meeting next week.
>>
>> Or send us an email :)
>>
>> Kind regards,
>>
>> Alistair Miles
>> Jun Zhao
>> Graham Klyne
>> David Shotton
>>
>>
>> -- 
>> Alistair Miles
>> Senior Computing Officer
>> Image Bioinformatics Research Group
>> Department of Zoology
>> The Tinbergen Building
>> University of Oxford
>> South Parks Road
>> Oxford
>> OX1 3PS
>> United Kingdom
>> Web: http://purl.org/net/aliman
>> Email: alistair.miles@zoo.ox.ac.uk
>> Tel: +44 (0)1865 281993
>>
>>
>>
>
>
>

-- 
Alistair Miles
Senior Computing Officer
Image Bioinformatics Research Group
Department of Zoology
The Tinbergen Building
University of Oxford
South Parks Road
Oxford
OX1 3PS
United Kingdom
Web: http://purl.org/net/aliman
Email: alistair.miles@zoo.ox.ac.uk
Tel: +44 (0)1865 281993
Received on Wednesday, 17 December 2008 18:04:19 UTC