Database versioning and maintenance from Peter Ansell on 2011-08-03 (public-semweb-lifesci@w3.org from August 2011)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Wed, 3 Aug 2011 21:02:05 +1000
To: w3c semweb hcls <public-semweb-lifesci@w3.org>
Message-ID: <CAGYFOCRCtRhfB4=S0zEFc6mtZ84ZpQEWBrnwjJX6LkZ--FWOzg@mail.gmail.com>
Hi all,

Earlier today I received this enquiry from Joerg Kurt Wegner on the
Bio2RDF support tracker

https://sourceforge.net/tracker/?func=detail&atid=814190&aid=3385405&group_id=142631

"I have a maintenance and development timeline question.

Can you please list all integrated data sources with their current
date and version, especially when data sources were updated the last
time.
I found a comment on the W3C mailing list, still this is not
sufficient for getting a "maintenance feeling".
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2011Jul/0000.html
What is the general strategy with respect to the Chembl data set, I
appreciate the various versions out there and @Egon, can you ensure
regular update cycles and will Bio2RDF ensure that, too?
Would it not be better that the group of John Overington is publishing
regular updates as part of their regular releases?
Can someone knowledgeable please point me to an actual Wiki page with
all the data sources, versions, and normalizations/alignments?

Next question, what is the general development timeline for data
sources and applications, is there somewhere a clean overview? Thanks.

Best regards,
Joerg Kurt Wegner
http://www.joergkurtwegner.eu"


I think it requires some general discussion so I am posting the
response here, as he referenced a recent comment on this list in his
question.

Bio2RDF have always encouraged providers to publish RDF themselves,
although in some cases we duplicate the datasets to normalise them
with respect to blank nodes that are not resolvable using URIs. If
they provide SPARQL endpoints then we can use them directly with
dynamic rewriting of URLs to make it possible for scientists to use
the different Bio2DF integrated services directly, for example,
http://bio2rdf.org/linksns/targetnamespace/namespace:identifier. As
far as I know there are no other Linked Data Providers that have
integrated datasets in this way, without pulling them all into a
single huge database.

However, that is a little bit of an aside. The reality is that Bio2RDF
relies on sporadic grant funding, as with most other scientific
knowledge providers, so there is only one person assigned to
RDFisation, and they have a PhD thesis and a different job to work on
at the same time. I on the other hand, have only been working on the
server software (which is separate to ths RDF generation and
maintenance). I have started a new job, around the same time since my
PhD has also come close to ending. Regular update cycles need reliable
long-term funding, which in most cases is not available, as grant
funding bodies favour innovation over maintenance. See the recent
effort by KEGG to keep itself alive using subscriptions, and it was
one of the bigger and mature dataset providers. Bio2RDF is much much
smaller in terms of funding, so there will be no guarantees, although
we will keep a best effort approach to keeping the datasets available.

I don't have any answers about why there is no immediate information
available for Bio2RDF datasets. There was one effort to do this, at
http://release.bio2rdf.org/sparql , but it was not completed, so the
database is virtually empty.

On your question about Chembl in Bio2RDF, we currently directly use
Egon's sparql endpoint to provide access to it, but we can easily
switch, thanks to the way the server can be configured. If John
Overington is publishing RDF, (preferably using a SPARQL endpoint and
scripts so that others can regenerate the RDF if they need to based on
the raw data), then we should be able to transparently switch Bio2RDF
to using that dataset, barring unresolvable changes in the dataset
structure and identifiers.

Hopefully in the future, SPARQL 1.1 Service Descriptions will be
widely deployed with database authors providing integrated database
provenance to provide up to date access to all of this information
without people having to maintain it in a centralised list, you will
be able to generate lists etc within minutes, assuming these are
static files and not slow automatic generated on the fly descriptions.
The only centralised list that I know of is ckan.org, but even it
would be no match for individual publishing at the SPARQL endpoint
level.

On a slight side note, the registry functions provided by other
distributed data providers, specifically, BioMart, could easily be
extended to provide both RDF (as they provide the basic elements with
their SPARQL/XML-only query endpoints) and Service Descriptions for
their endpoints, as the basic mechanism is already there. However, the
maintenance schedules and updates will always rely on long term static
data hosting funding being available.

Cheers,

Peter Ansell
Received on Wednesday, 3 August 2011 11:02:36 UTC