Re: A proposed provenance wg draft charter from James Cheney on 2010-10-25 (public-xg-prov@w3.org from October 2010)

From: James Cheney <jcheney@inf.ed.ac.uk>
Date: Mon, 25 Oct 2010 17:03:59 +0100
To: Olaf Hartig <hartig@informatik.hu-berlin.de>
Cc: public-xg-prov@w3.org
Message-Id: <E9E4BAEE-6C3A-4B07-AC7D-E1C049A0AB48@inf.ed.ac.uk>
On Oct 25, 2010, at 2:53 PM, Olaf Hartig wrote:

> On Monday 25 October 2010 13:32:10 Paolo Missier wrote:
>> Hi,
>>  a couple of further comments on this thread:
>>
>> On 25/10/2010 07:41, Olaf Hartig wrote:
>>> Hey,
>>>
>>> On Sunday 24 October 2010 15:50:28 Paul Groth wrote:
>>>> Hi Olaf,
>>>>
>>>> Thanks for the comments. Really good. Some replies in-line
>>>> [...]
>>>> * You speak about "provenance of any web-resource". I still  
>>>> struggle to
>>>> see
>>>
>>> how Web resources, in general, have provenance. To me provenance is
>>> associated primarily with specific representations of Web  
>>> resources that
>>> we retrieve from the Web.
>>
>> why wouldn't resources have provenance?
>
> The problem is that a Web resource may change; it may have a  
> different state at
> a different point in time. What would the provenance of such a  
> changing thing
> be?
> A specific representation of a Web resource cannot change. That's  
> why I find it
> much easier to talk about the provenance of such representations  
> rather than
> the Web resource itself.
> That's probably also why artifacts in OPM are immutable pieces of  
> state.


hi Olaf and others,

This seems like an important point.  Some of the work on provenance  
for data/databases (e.g. by me and Irini and others) is really about  
recording the relationships between past versions (which I'd call  
"dynamic" provenance), not just about describing process-step  
derivation relationships between immutable artifacts ("static"  
provenance).

By analogy, real-world artifacts (e.g. the Mona Lisa) can have  
provenance (ownership, modification or preservation history) even  
though they change - in fact, you can't stop physical artifacts from  
changing (think radioactive decay), and the fact that such artifacts  
can change over time in inessential ways while retaining "identity" is  
part of what makes provenance information so important for  
establishing authenticity.  Knowing it's the same canvas painted by da  
Vinci, and not a well-executed copy, is part of what makes it  
valuable: we can learn things about da Vinci that we can't learn from  
a copy.

The situation is complicated further by the fact that digital  
artifacts can be copied "exactly".  Thus, there may be many minor  
variants of a data item floating around the Web, each having been  
derived from an original source by a complicated, and currently  
invisible, process.  So the analogy with physical objects breaks down  
a bit.  But many Web resources (such as databases) have enough of the  
attributes of physical stateful things that the analogy can still make  
some sense.

I can imagine wanting to know the (dynamic) provenance, or history, of  
a record in a database as part of understanding the static provenance  
of a result obtained from the database at a given moment in time.  In  
particular, a long-running process might have accessed different  
versions of a database that was updated during the run, leading to a  
result that uses inconsistent data from the different versions.

I view the wg proposal as encouraging focus and standardization on the  
static case, where there are several mature and broadly similar  
proposals such as OPM, PML, Provenir, and others.  There is currently  
no broad consensus for representing fine-grained, dynamic provenance/ 
version information AFAIK (or for propagation of provenance through  
database queries and updates), nor are there mature systems that do  
this.  This still seems like a research issue to me which would be  
premature to try to standardize, and this discussion thread suggests  
there may still be disagreement or confusion about basic concepts.

So one suggestion I was going to make was that in addition to  
recommending both a WG to focus on standardizing a consensus exchange  
format, we might propose an interest group or low-maintenance activity  
to facilitate discussion and convergence on broader provenance issues,  
such as dynamic provenance in databases or RDF stores, provenance  
querying, etc., and revisiting the case for standards as these areas  
mature (I'm not sure how one makes a case for this though).

This got a bit long-winded.  Thoughts?

--James

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Received on Monday, 25 October 2010 16:05:08 UTC