Scripts and refresh rates for LODD data + Key Provenance for federation issues from M. Scott Marshall on 2010-10-07 (public-semweb-lifesci@w3.org from October 2010)

From: M. Scott Marshall <mscottmarshall@gmail.com>
Date: Thu, 7 Oct 2010 15:17:16 +0200
To: HCLS <public-semweb-lifesci@w3.org>, Anja Jentzsch <anja@anjeve.de>, oktie@cs.toronto.edu
Cc: david@zepheira.com, Natasha Noy <noy@stanford.edu>, Paea LePendu <plependu@gmail.com>, Nigam Shah <nigam@stanford.edu>, Paul Groth <pgroth@few.vu.nl>, Jun Zhao <jun.zhao@zoo.ox.ac.uk>, "Eric Prud'hommeaux" <eric@w3.org>
Message-ID: <AANLkTinVxVyKe877VFcdAxP_9zXQGo2ph1aPUGX=rfKv@mail.gmail.com>

I have two simple questions that tie in to the discussion of federation below:

What is the refresh rate for DrugBank and LinkedCT?
Also, where can I find the scripts used to create the current RDF
renderings for DrugBank and LinkedCT?

Anja, Oktie: Can you help me with the above questions?
I couldn't find the answers by looking at http://esw.w3.org/HCLSIG/LODD/Data .

------------------------------------------------------------------

[CC'ing a sampling of people interested in federation and provenance.]

Linked Open Drug Data task force has made important biomedical
contributions to the Linked Open Data cloud. Many of us, including
those in the BioRDF task force, have observed questions that arise
while creating applications that access these data and SPARQL
endpoints, as well as the RDF graphs behind them. Considering only the
most basic provenance:

* Is it an ontology or data that has been mapped to RDF (possibly
populated ontology)?
* When was the data last refreshed?
* Where is the original source data?
* What method(s) did they use?
* Where is the software located (if a script or code was used)?
* Who made this resource (and the script)?
* Is it in OWL, SKOS, or another type of RDF?

In a Semantic Web (a federation of resources), we will eventually have
to choose between many sources, depending on our needs. I know for
example, of several different versions of DrugBank in RDF that are
being offered from multiple locations, as well as SNOMED in both SKOS
and OWL.

Ultimately, we would like to automate the selection of sources to
federate, so that it can occur dynamically and base it on a selection
policy expressed in a Semantic Web language. However, the advantages
of a dynamic federation will be lost if we must consult wiki pages or
people in order to select data sources and formulate our queries. So,
it is crucial to harmonize approaches to expressing such provenance in
RDF and make such information available in RDF when publishing any
linked data. Making such bread crumbs available in RDF will enable
people to carry out the whole process without having to leave SPARQL.

Hopefully, what I've written above doesn't seem controversial. Where
we need to build consensus is how to best represent such information
so that the many federations and linked data clouds now being built
can interoperate (i.e. we can eventually query them to get basic
information about their contents and origins). We also need consensus
on where such information should reside.

I like the idea of giving each repository/graph/context in a
triplestore its own URI (as done for NCBO's SPARQL endpoint) so that
it can include its own 'metadata' or provenance in the graph itself.
David Wood, CC'd, suggested that this is a popular approach. You could
eventually aggregate such information from all graphs behind an
endpoint (e.g. with a 'crawler' ) and provide it from an information
service (i.e. a type of SPARQL endpoint provenance index *graph*).

I understand that SPARQL 1.1 WG is considering methods of describing
what is behind a SPARQL endpoint for incorporation into SPARQL.

So, in an ideal scenario, finding a particular graph starting with the
SPARQL endpoint might go something like this (ROUGH OUTLINE):

1) Query the SPARQL 1.1 endpoint to find out where it's 'provenance
index' is located. [based on whatever SPARQL WG comes up with]

2) Query the 'provenance index' to find graphs that meet your
provenance criteria.

3) Possibly query a selected graph for more provenance information
that wasn't included in the index.

4) Query the selected graph.

-Scott

Received on Thursday, 7 October 2010 13:18:07 UTC