Re: PROV-ISSUE-1 (define-resource): Definition for concept 'Resource' [Provenance Terminology] from Luc Moreau on 2011-05-25 (public-prov-wg@w3.org from May 2011)

From: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
Date: Wed, 25 May 2011 16:13:12 +0100
To: Graham Klyne <GK@ninebynine.org>
CC: public-prov-wg@w3.org
Message-ID: <EMEW3|f93ca2aa002ffb1b3b0398b40dc59661n4OGDM08L.Moreau|ecs.soton.ac.uk|4DDD1C88>
On 05/25/2011 02:54 PM, Graham Klyne wrote:
> Luc Moreau wrote:
>> Nothing in the example is restricted to rdf or triple stores.
>> It also applies to a table in a relational database (and its xml 
>> serialization),
>> or an excel spreadsheet (and a csv representation).
>
> Luc,
>
> You're right.  When I made my previous comments, I was referring to 
> your illustration inspired by the example (repeated at the end of this 
> email).
>
> But I did go back and re-check the example at 
> http://www.w3.org/2011/prov/wiki/ProvenanceExample and in light of our 
> discussion I think I see a problem there (tangentially related to our 
> discussion of "containers").
>
> I repeat here the first steps of the example for ease of reference:
> [[
> government (gov) converts data (d1) to RDF (f1) at time (t1)
> government (gov) generates provenance information (prov) regarding RDF 
> (f1)
> government (gov) publishes RDF data (f1) along with its provenance 
> (prov) on a portal with a license (li1); the rdf data is now available 
> as a Web resource (r1)
> analyst (alice) downloads a turtle serialization (lcp1) of the 
> resource (r1) from government portal
> ]]
>
> Based on your comments, I think that "f1" is intended to be a local 
> (non-published) copy of the RDF data.  As such, I'm not sure it makes 
> sense to generate and subsequently publish provenence "prov" about 
> "f1", because when "f1" is copied to a publication location and made 
> available as "r1", "prov" is still about the unpublished "f1".  The 
> process of publication is part of the provenence of "r1", which is 
> absent from the provenance of "f1".
>
> And while this may seem like a discourse about the cardinality of a 
> set of angels dancing on a pinhead, I think there are some potentially 
> serious implications:
>
> Suppose there are two routes to publication that can be employed by 
> (gov) - e.g. two different employees who might handle the publication 
> process.  And suppose one uses a PC and the other uses a Mac computer 
> to perform the publication process.  Under certain circumstances, the 
> line endings of text files processed may be handled differently by 
> these different systems, possibly resulting in different published 
> content (r1).  Here the outcome is likely benign.  But suppose that it 
> is later discovered that the PC contains Malware that randomly 
> corrupts data that is being processed.  Now it can become important to 
> know what systems were used to perform the publication, as that 
> effects the reliability of the published result.  Surely, this MUST be 
> reflected in a complete provenance record, for any useful definition 
> of "complete"?
>
> The point is that (prov) calculated from (f1) is NOT the provenance of 
> (r1), but as stated the example publishes (prov) as if it IS the 
> provenance of (r1).
>

You're right. The example needs to be fixed.



I think that the step "gov publishes prov" is the remit of the 
provenance access/query task force. It will have to decided how to do that.


Luc

> I have a hunch that once we get this bit right, handling of dynamic 
> resources may not need to appeal to the notation of a "container".
>

> (FWIW, where you appealed to l-values and r-values, I would look 
> towards a functional programming model where there are just values to 
> consider, and where each such value has a provenance.  But such values 
> are not simply extensionally defined, but must in some sense take 
> account of the context in which they occur - as the above example 
> about (f1) and (r1) - as well as their specific content.  I can 
> imagine that it is this notion of context which you see the container 
> supplying.  But I think that to do so conflates the notions of context 
> and dynamic update.)
>
> #g
> -- 
>
>
>>>> Illustration inspired by the example.
>>>>
>>>> - government (gov) converts data (d1) to RDF file (f1) at time (t1) 
>>>> using xlst transform
>>>> - government (gov) uploads RDF data (f1) into a triple store, 
>>>> exposed as  Web resource (r1)
>>>> - analyst (alice) downloads a turtle serialization (lcp1) of the 
>>>> resource (r1) from government portal
>>>>
>>>> Illustrations:
>>>> - r1: is a resource: it's the triple store, its a container, its 
>>>> content can vary over time
>>>> - lcp1: is a r-text (turtle serialization) of a given snapshot 
>>>> (created by, or available at the time of, download)
>>>> - f1 is a local file: it can be seen as a stateless anonymous 
>>>> resource, with a single r-text.
>>>>
>>>> If in addition:
>>>> - analyst (alice) downloads a rdf/xml serialization (lcp2) of the 
>>>> resource (r1)
>>>>
>>>> If the content of r1 has not changed, then lcp2 and lcp1 are both 
>>>> r-texts of a same r-snapshot.
>>>>
>>>> Note that this is not limited to RDF (as Graham mentioned)
>>>>
>>>> - newspaper (news), uses a CMS to publish the incidence map (map1), 
>>>> chart (c1) and
>>>>   the image (img1) within a document (art1) written by (joe) using
>>>>   license (li2)
>>>> - newspaper (news), updates art1, adding a correction following a 
>>>> complaint from a reader
>>>>
>>>> Illustrations:
>>>> - art1 is a also resource, with two r-snapshots (before and after 
>>>> correction)
>>>> - with language negotiation, an http client can download  html and 
>>>> xhtml representations (i.e., r-texts) of the article
>
>

-- 
Professor Luc Moreau
Electronics and Computer Science   tel:   +44 23 8059 4487
University of Southampton          fax:   +44 23 8059 2865
Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
Received on Wednesday, 25 May 2011 15:13:52 UTC