- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Wed, 3 Aug 2011 21:02:05 +1000
- To: w3c semweb hcls <public-semweb-lifesci@w3.org>
Hi all, Earlier today I received this enquiry from Joerg Kurt Wegner on the Bio2RDF support tracker https://sourceforge.net/tracker/?func=detail&atid=814190&aid=3385405&group_id=142631 "I have a maintenance and development timeline question. Can you please list all integrated data sources with their current date and version, especially when data sources were updated the last time. I found a comment on the W3C mailing list, still this is not sufficient for getting a "maintenance feeling". http://lists.w3.org/Archives/Public/public-semweb-lifesci/2011Jul/0000.html What is the general strategy with respect to the Chembl data set, I appreciate the various versions out there and @Egon, can you ensure regular update cycles and will Bio2RDF ensure that, too? Would it not be better that the group of John Overington is publishing regular updates as part of their regular releases? Can someone knowledgeable please point me to an actual Wiki page with all the data sources, versions, and normalizations/alignments? Next question, what is the general development timeline for data sources and applications, is there somewhere a clean overview? Thanks. Best regards, Joerg Kurt Wegner http://www.joergkurtwegner.eu" I think it requires some general discussion so I am posting the response here, as he referenced a recent comment on this list in his question. Bio2RDF have always encouraged providers to publish RDF themselves, although in some cases we duplicate the datasets to normalise them with respect to blank nodes that are not resolvable using URIs. If they provide SPARQL endpoints then we can use them directly with dynamic rewriting of URLs to make it possible for scientists to use the different Bio2DF integrated services directly, for example, http://bio2rdf.org/linksns/targetnamespace/namespace:identifier. As far as I know there are no other Linked Data Providers that have integrated datasets in this way, without pulling them all into a single huge database. However, that is a little bit of an aside. The reality is that Bio2RDF relies on sporadic grant funding, as with most other scientific knowledge providers, so there is only one person assigned to RDFisation, and they have a PhD thesis and a different job to work on at the same time. I on the other hand, have only been working on the server software (which is separate to ths RDF generation and maintenance). I have started a new job, around the same time since my PhD has also come close to ending. Regular update cycles need reliable long-term funding, which in most cases is not available, as grant funding bodies favour innovation over maintenance. See the recent effort by KEGG to keep itself alive using subscriptions, and it was one of the bigger and mature dataset providers. Bio2RDF is much much smaller in terms of funding, so there will be no guarantees, although we will keep a best effort approach to keeping the datasets available. I don't have any answers about why there is no immediate information available for Bio2RDF datasets. There was one effort to do this, at http://release.bio2rdf.org/sparql , but it was not completed, so the database is virtually empty. On your question about Chembl in Bio2RDF, we currently directly use Egon's sparql endpoint to provide access to it, but we can easily switch, thanks to the way the server can be configured. If John Overington is publishing RDF, (preferably using a SPARQL endpoint and scripts so that others can regenerate the RDF if they need to based on the raw data), then we should be able to transparently switch Bio2RDF to using that dataset, barring unresolvable changes in the dataset structure and identifiers. Hopefully in the future, SPARQL 1.1 Service Descriptions will be widely deployed with database authors providing integrated database provenance to provide up to date access to all of this information without people having to maintain it in a centralised list, you will be able to generate lists etc within minutes, assuming these are static files and not slow automatic generated on the fly descriptions. The only centralised list that I know of is ckan.org, but even it would be no match for individual publishing at the SPARQL endpoint level. On a slight side note, the registry functions provided by other distributed data providers, specifically, BioMart, could easily be extended to provide both RDF (as they provide the basic elements with their SPARQL/XML-only query endpoints) and Service Descriptions for their endpoints, as the basic mechanism is already there. However, the maintenance schedules and updates will always rely on long term static data hosting funding being available. Cheers, Peter Ansell
Received on Wednesday, 3 August 2011 11:02:36 UTC