Re: PROV-ISSUE-1 (define-resource): Definition for concept 'Resource' [Provenance Terminology] from Graham Klyne on 2011-05-25 (public-prov-wg@w3.org from May 2011)

From: Graham Klyne <GK@ninebynine.org>
Date: Wed, 25 May 2011 14:54:51 +0100
To: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
CC: public-prov-wg@w3.org
Message-ID: <4DDD0A2B.60003@ninebynine.org>
Luc Moreau wrote:
> Nothing in the example is restricted to rdf or triple stores.
> It also applies to a table in a relational database (and its xml 
> serialization),
> or an excel spreadsheet (and a csv representation).

Luc,

You're right.  When I made my previous comments, I was referring to your 
illustration inspired by the example (repeated at the end of this email).

But I did go back and re-check the example at 
http://www.w3.org/2011/prov/wiki/ProvenanceExample and in light of our 
discussion I think I see a problem there (tangentially related to our discussion 
of "containers").

I repeat here the first steps of the example for ease of reference:
[[
government (gov) converts data (d1) to RDF (f1) at time (t1)
government (gov) generates provenance information (prov) regarding RDF (f1)
government (gov) publishes RDF data (f1) along with its provenance (prov) on a 
portal with a license (li1); the rdf data is now available as a Web resource (r1)
analyst (alice) downloads a turtle serialization (lcp1) of the resource (r1) 
from government portal
]]

Based on your comments, I think that "f1" is intended to be a local 
(non-published) copy of the RDF data.  As such, I'm not sure it makes sense to 
generate and subsequently publish provenence "prov" about "f1", because when 
"f1" is copied to a publication location and made available as "r1", "prov" is 
still about the unpublished "f1".  The process of publication is part of the 
provenence of "r1", which is absent from the provenance of "f1".

And while this may seem like a discourse about the cardinality of a set of 
angels dancing on a pinhead, I think there are some potentially serious 
implications:

Suppose there are two routes to publication that can be employed by (gov) - e.g. 
two different employees who might handle the publication process.  And suppose 
one uses a PC and the other uses a Mac computer to perform the publication 
process.  Under certain circumstances, the line endings of text files processed 
may be handled differently by these different systems, possibly resulting in 
different published content (r1).  Here the outcome is likely benign.  But 
suppose that it is later discovered that the PC contains Malware that randomly 
corrupts data that is being processed.  Now it can become important to know what 
systems were used to perform the publication, as that effects the reliability of 
the published result.  Surely, this MUST be reflected in a complete provenance 
record, for any useful definition of "complete"?

The point is that (prov) calculated from (f1) is NOT the provenance of (r1), but 
as stated the example publishes (prov) as if it IS the provenance of (r1).

I have a hunch that once we get this bit right, handling of dynamic resources 
may not need to appeal to the notation of a "container".

(FWIW, where you appealed to l-values and r-values, I would look towards a 
functional programming model where there are just values to consider, and where 
each such value has a provenance.  But such values are not simply extensionally 
defined, but must in some sense take account of the context in which they occur 
- as the above example about (f1) and (r1) - as well as their specific content. 
  I can imagine that it is this notion of context which you see the container 
supplying.  But I think that to do so conflates the notions of context and 
dynamic update.)

#g
--


>>> Illustration inspired by the example.
>>>
>>> - government (gov) converts data (d1) to RDF file (f1) at time (t1) 
>>> using xlst transform
>>> - government (gov) uploads RDF data (f1) into a triple store, exposed 
>>> as  Web resource (r1)
>>> - analyst (alice) downloads a turtle serialization (lcp1) of the 
>>> resource (r1) from government portal
>>>
>>> Illustrations:
>>> - r1: is a resource: it's the triple store, its a container, its 
>>> content can vary over time
>>> - lcp1: is a r-text (turtle serialization) of a given snapshot 
>>> (created by, or available at the time of, download)
>>> - f1 is a local file: it can be seen as a stateless anonymous 
>>> resource, with a single r-text.
>>>
>>> If in addition:
>>> - analyst (alice) downloads a rdf/xml serialization (lcp2) of the 
>>> resource (r1)
>>>
>>> If the content of r1 has not changed, then lcp2 and lcp1 are both 
>>> r-texts of a same r-snapshot.
>>>
>>> Note that this is not limited to RDF (as Graham mentioned)
>>>
>>> - newspaper (news), uses a CMS to publish the incidence map (map1), 
>>> chart (c1) and
>>>   the image (img1) within a document (art1) written by (joe) using
>>>   license (li2)
>>> - newspaper (news), updates art1, adding a correction following a 
>>> complaint from a reader
>>>
>>> Illustrations:
>>> - art1 is a also resource, with two r-snapshots (before and after 
>>> correction)
>>> - with language negotiation, an http client can download  html and 
>>> xhtml representations (i.e., r-texts) of the article
Received on Wednesday, 25 May 2011 14:52:19 UTC