On provenance access and web architecture

I've been thinking a bit about our discussions about using the provenance XG 
final report, specifically section 6 
(http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#Provenance_in_Web_Architecture). 


In this message, I hope to stand back a little and sketch an initial approach to 
how provenance access can be addressed within the web architecture.  I don't 
claim to offer a complete solution, but it is one that I suspect will be 
sufficient in a large number of cases.  Rather than starting from the provenance 
XG final report, I start from this working group's charter and the published 
architecture of the world wide web.  This does not necessarily incompatible with 
the direction suggested by provenance XG final report, but I think the emphasis 
and perspective may be rather different.

...

The charter for this WG says:

"... specifies how provenance can be accessed or queried in embedded documents 
and from remote services. Specifically, it defines how to access provenance 
embedded in an HTML document using RDFa, how to access provenance from a service 
by means of HTTP, and how to query provenance through a SPARQL endpoint."
-- http://www.w3.org/2011/01/prov-wg-charter

To my mind, the starting point for accessing provenance information on the web 
should be simple:  just use HTTP.  The remaining issues, then, are (a) how to 
know that provenance information is available, and (b) what URI to use to 
retrieve it.  And POWDER seems to address these concerns.  The charter suggests 
some variations/extensions of this idea, but I'd like to focus first on the 
simple case.

I take http://www.w3.org/TR/webarch/ (AWWW) as my starting point for a 
description of web architecture.  Right at the start (section 2), this document 
addresses identification, and asserts "To benefit from and increase the value of 
the World Wide Web, agents should provide URIs as identifiers for resources."

...

So I think that one of the first questions to ask concerning how provenance 
access works within the web architecture is:

"What resources do we recognize and identify with URIs?"

My answer would start with:
(1) resources about which we wish to assert provenance information
(2) resources that are (contain?) provenance information about other resources 
(to be useful, we would generally assume these are dereferenceable on the web).

The charter also suggests "provenance ... in embedded documents ... specifically 
... RDFa".  I think this needs clarification, but suggests:
(3) resources that contain both textual information and provenance information

Thinking about embedded provenance also suggests:
(4) resources that contain a resource (state) representation *and* provenance 
about that resource

The discussion  that follows makes no assumptions about the data format of 
resource or provenance data used (cf. AWWW section 5.1).

...

Next question: "How can provenance information be accessed"?

Having identified resources and URIs, I think the initial mechanism for 
retrieving information given the corresponding URI is simple:  just use web 
retrieval mechanisms.   This (and more) is discussed in AWWW section 3.

Specifically, given the URI of a provenance resource, a simple mechanism is to 
use HTTP to retrieve a representation of that provenance.

...

The next question I then see is:  "given some resource URI, how do I discover if 
there is provenance information associated with the resource, and what URI can I 
use to retrieve that provenance information?"

Looking for existing, established web protocols, we can see that POWDER 
(http://www.w3.org/TR/powder-dr/#assoc-linking) proposes a number of possible 
mechanisms.  Many of these mechanisms are format-dependent, so may not be 
applicable in all cases, but the HTTP Link element could be used for any 
resource for which we have a dereferenceable URI.  Registering a link-rel type 
for provenance would provide a way to signal the availability and mURI of an 
associated provenance resource.

(Another solution based on existing specifications might use WebDAV, but this is 
a less obvious fit, and requires a greater degree of server- and client- side 
support to deploy.  There may be other existing standards that could be used: 
ideas and suggestions are welcome.)

...

The above discussion suggests a minimal set of mechanisms for provenance 
discovery and retrieval that are firmly rooted in Web architecture and existing 
standards. It is easy to imagine further situations for which these are 
insufficient, but to my mind they represent a (hopefully) non-controversial 
starting point. I think anything beyond this needs to be in response to a 
clearly articulated problem statement that cannot be adequately addressed using 
these basic mechanisms.

#g
--

Received on Thursday, 19 May 2011 16:45:48 UTC