Re: PROV-ISSUE-7 (define-derivation): Definition for Concept 'Derivation' [Provenance Terminology] from Simon Miles on 2011-06-05 (public-prov-wg@w3.org from June 2011)

From: Simon Miles <simon.miles@kcl.ac.uk>
Date: Sun, 5 Jun 2011 21:54:58 +0100
To: Provenance Working Group WG <public-prov-wg@w3.org>
Message-ID: <BANLkTikouZqerwRHexF67JX4UAaz9duWTw@mail.gmail.com>
I think "invariant" is good too.

I was unclear, regarding the proposal to focus on "values/things that
are immutable according to some perspective or viewpoint", whether it
is the latter "values" for which we determine provenance or state
derivation relationships, or whether the "values" are properties of
the entities which have provenance and there other mutable (variant)
values?

If only for my own understanding, I tried looking across the different
threads on this list. Here's my interpretation of what has been
implied in terms of definitions (but I might well be misinterpreting).

An entity is something identifiable.
An account is a record of something that has occurred from a
particular perspective.
An invariant property of an entity is a property of that entity which
is invariant according to a particular perspective.
An abstraction of an entity is another entity with a subset of its
invariant properties, according to a particular perspective.
B derives from A if some of B's invariant properties are due to A's
invariant properties.

An example trying to capture all the above:

Entities:
 - E1: A government data set with UK government identifier GOVID-12345
 - E2: The data set with a data value for row 2012 being £7,500
 - E3: The corrected data set with the value for row 2012 being £9,000
 - E4: An Excel 2010 spreadsheet containing the corrected data set

Accounts:
 - A1: An account from a perspective in which any government data set
will always retain the same UK government identifier (a new identifier
means a new data set)
 - A2: An account from a perspective in which any change of value in a
data set means it is a new version of that data set
 - A3: An account from a perspective in which any changes to a file by
writing create a new data set, while any changes due to reading do not

Invariant properties:
 - P1: Identifier GOVID-12345 is invariant for E1, E2, E3, E4 with
respect to account A1
 - P2: All the data values (including £7,500 for 2012) are invariant
for E2 with respect to account A2
 - P3: All the data values (including £9,000 for 2012) are invariant
for E3, E4 with respect to account A2
 - P4: All bytes of the spreadsheet are invariant for E4 except those
changed on reading (e.g. Excel saves the current open worksheet,
cursor position etc. even without editing) with respect to account A3
 - P5: The data set (E1) having existed is invariant for E1, E2, E3,
E4 with respect to any account
 - P6: The first version of the data set (E2) having existed is
invariant for E2 with respect to any account
 - P7: The corrected version of the data set (E3) having existed is
invariant for E3, E4 with respect to any account
 - P8: The Excel data set (E4) having existed is invariant for E4 with
respect to any account

Abstractions:
 - E1 abstracts E2, E3, E4
 - E3 abstracts E4

Derivation:
 - E3 derives from E2 because, aside from the corrected value, all
other values are copied directly from it (P3 is partly due to P2)
 - E3 also derives from the correction made to the data set, changing
£7,500 to £9,000 (could be called E5, omitted above for brevity)

We could then say that the provenance of an entity is/includes a
record of how that entity came to have its invariant properties.

Provenance:
 - Provenance of E1 is how it came to be generated (P5) and came to
have its ID (P1)
 - Provenance of E2 is how it came to be generated (P5, P6), given its
ID (P1), and populated with the data it has (P2)
 - Provenance of E3 is how it came to be generated (P5, P7), given its
ID (P1), and populated with the data it has (P3)
 - Provenance of E4 is how it came to be generated (P5, P7, P8), given
its ID (P1), populated with the data it has (P3), and serialised to
its given bytes (P4)

It would be good to know if others are interpreting the consensus in
the same way!

Thanks,
Simon

On 3 June 2011 21:36, Luc Moreau <L.Moreau@ecs.soton.ac.uk> wrote:
> I think I am also comfortable with using the term "invariant", if it helps gain consensus.
>
>
>
> Professor Luc Moreau
> Electronics and Computer Science
> University of Southampton
> Southampton SO17 1BJ
> United Kingdom
>
> On 3 Jun 2011, at 15:06, "Graham Klyne" <GK@ninebynine.org> wrote:
>
>> Luc,
>> Jim,
>> Khalid,
>>
>> I'm responding to all of you at once.
>>
>> Short answer: what Luc says.
>>
>> I find myself preferring the term "invariant" to "immutable" for just this reason.
>>
>> ...
>>
>> Longer answer:  there's not a specific thing I want to capture through derivation of mutual resources.  I'm just concerned that insisting on immutability may prevent useful expression.
>>
>> I'll illustrate with an example from a completely different field.  For some years, I have been involved peripherally in definition and registration of URI schemes, and remain IANA's designated reviewer for new URI schemes.  Several years ago, there's was much discussion about registering new URI schemes vs registering new URN namespaces [2] vs using http URIs for everything.  A specific example is the info: URI scheme [3].  I argued at the time that this could equally served by a URN namespace.  But the original definition of URN requirements [4] made some apparently strong assertions about persistence and permanance of URNs which the community behind info felt were too constraining, so we ended up with an arguable unnecessary new URI scheme. Some further history at [5].
>>
>> Looking back, I now think the original language in [4] was over-interpreted, and many people didn't fully recognize that permanence of identity didn't constrain the identified thing itself possibly changing or going away. There was an expectation of immutability, not even explicitly stated, but also not dispelled.
>>
>> This is the kind of concern I have with insisting on immutability in subjects of provenance at the outset.
>>
>> [1] http://www.ietf.org/rfc/rfc2141.txt
>> [2] http://www.ietf.org/rfc/rfc2611.txt, http://tools.ietf.org/html/rfc3406
>> [3] http://www.ietf.org/rfc/rfc4452.txt
>> [4] http://tools.ietf.org/html/rfc1737
>> [5] http://www.w3.org/TR/uri-clarification/
>>
>> #g
>> --
>>
>> Luc Moreau wrote:
>>> Hi Jim, Graham, Klyne,
>>> Following yesterday's call, and seeing this thread, it seems that  "Immutable value" is too restrictive because too absolute.
>>> What about saying we focus on "/values/things that are immutable according to some perspective or viewpoint/"?
>>> It seems to offer the necessary trade-off and flexibility, with
>>> - a stable property required for provenance
>>> - change being allowed according to other viewpoints.
>>> Cheers,
>>> Luc
>>> On 06/03/2011 02:03 AM, Myers, Jim wrote:
>>>> What do you want to capture with derivation of mutable resources? Simply that one mutable resource can be used in a process and produce another different mutable resouirce? If so, I'd ask why we should consider this case any different than immutable? (Does the fact that most of what we want to call immutable resources are undergoing constant change (bits getting refresh charges, files moving about in memory caches, etc.) cause any issue with the basic OPM-style model? I think all of these cases are handled just fine by OPM-style constructs and I'd argue further that the key concept about artifacts was not complete immutability with respect to any process we can think of but immutability with respect to the processes involved in the provenance (Eggs used in cake baking do not come out as modified eggs (they become a new cake), but an egg in the fridge and the warmer egg waiting to be mixed are considered the same egg only because we don't want to discuss/report on the warming process that occurred. The fact that an egg has mutability in its temperature doesn't make it a bad artifact in OPM or cause trouble in reporting a baking process...)
>>>> The mutable case that presents a question is should we provide a second mechanism to allow one to describe a process that changes the state of a mutable resource?-to say that  egg with temperaturcold is the  same egg with temperature warm after a heating process. I suspect that we can't avoid this use case completely but we might not have to create a separate mechanism: If we allow a resource egg to be associated with cold-egg and warm-egg resources, we can use the OPM like mechanism (cold-egg <-- heating <-- warm-egg) while adding cold-egg and warm-egg are 'aspectsof" the same mutable egg which 'participates' in a heating process. I think this is general and minimally disruptive. One could say that an egg participated in heating without creating other resources, but one could not directly describe the temperature of the egg before and after heating without creating the cold and warm egg artifacts.   I think this also covers what we want from agents and sources - we want to convey that they participate in a process and, while their state changes as they do so, we don't want to document their state changes. But as Simon says we may still want to treat them (e.g. the Royal Society) as resources and talk about their creation so it would be valuable if they could just be artifacts in the context of creation/founding type events. Today, we have agents and sources as different types than artifact so there is no way to talk about their founding, etc.
>>>> --  Jim
>>>>
>>>> ________________________________
>>>>
>>>> From: public-prov-wg-request@w3.org <mailto:public-prov-wg-request@w3.org> on behalf of Graham Klyne
>>>> Sent: Thu 6/2/2011 3:45 PM
>>>> To: Khalid Belhajjame
>>>> Cc: Luc Moreau; public-prov-wg@w3.org <mailto:public-prov-wg@w3.org>
>>>> Subject: Re: PROV-ISSUE-7 (define-derivation): Definition for Concept 'Derivation' [Provenance Terminology]
>>>>
>>>>
>>>>
>>>> Khalid Belhajjame wrote:
>>>>
>>>>> Hi Graham,
>>>>>
>>>>> >I agree that many of the examples of derivation we have raised relate
>>>>> to resource states.  But if, as has been suggested by myself and others,
>>>>> resource states are themselves resources >(especially when named for the
>>>>> purposes of expressing a derivation), then such derivations can equally
>>>>> be regarded as relating resources.  I think that's more a difference of
>>>>> terminology than >fundamental.
>>>>>
>>>>> Would it be fair then to say that in that view resources are immutable
>>>>> resources?
>>>>>
>>>> In the case of resources representing a snapshot of state, yes.
>>>>
>>>>
>>>>> Which bring me to the question, do we want to express derivations
>>>>> between mutable resources, or that is just something that we should
>>>>> avoid at this point?
>>>>>
>>>> (I'm finishing this email after today's telecon, so it's a bit of a re-run.)
>>>>
>>>> I think that many of our use-cases are based on invariant values, and the
>>>> near-term goal is to find expression for these.  So we definitely do want to
>>>> express derivations between non-varying values.  But in so doing, it's not clear
>>>> to me (yet) that we need to exclude mutable resources, so I say let's keep our
>>>> options open and not close off any possibilities that we don't have to.
>>>>
>>>> So my answer to avoiding mutable resources is: "yes and no".
>>>>
>>>> #g
>>>> --
>>>>
>>>>
>>>>
>>>>> Thanks, khalid
>>>>>
>>>>>
>>>>>> Where I think I may diverge from what you say is that I would not
>>>>>> limit such expressions of derivation to resources that happen to be a
>>>>>> state (or snapshot of state) of some resource.  I think defining that
>>>>>> distinction in a hard-and-fast way, that also aligns with various
>>>>>> intuitions we may have about derivation, may prove difficult to
>>>>>> achieve (e.g. as I think is suggested by Jim Meyers in
>>>>>> http://lists.w3.org/Archives/Public/public-prov-wg/2011Jun/0015.html
>>>>>> (*)).
>>>>>>
>>>>>> #g
>>>>>> --
>>>>>>
>>>>>> (*) I just love the W3C mailing list archives - so easy to find links
>>>>>> to messages, and thus capture provenance!
>>>>>>
>>>>>> Khalid Belhajjame wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> From the discussion so far on derivation it seems that most people
>>>>>>> tend to define derivation between resource states or resources state
>>>>>>> representations, but not for resources.
>>>>>>>
>>>>>>> My take on this is that in a context where a resource is mutable,
>>>>>>> derivations will mainly be used to associate resource states and
>>>>>>> resource states representations.
>>>>>>>
>>>>>>> That said, based on derivations connecting resource states and
>>>>>>> resources state representations, one can infer new derivations
>>>>>>> between resources. For example, consider the resource r_1 and the
>>>>>>> associated resource state r_1_s, and consider that r_1_s was used to
>>>>>>> construct a new resource state r_2_s, actually the first state, of
>>>>>>> the resource r2. We can state that r_2_s is derived from r_1_s, i.e.,
>>>>>>> r_1_s -> r_2_s. We can also state that the resource r_2 is derived
>>>>>>> from the resource r_1, i.e., r_1 -> r_2
>>>>>>>
>>>>>>> PS: I added a defintiion of derivation within this lines to the wiki:
>>>>>>> http://www.w3.org/2011/prov/wiki/ConceptDerivation
>>>>>>>
>>>>>>> Thanks, khalid
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 01/06/2011 07:49, Luc Moreau wrote:
>>>>>>>
>>>>>>>> Hi Graham,
>>>>>>>>
>>>>>>>> Isn't it that you used the duri scheme to name the two resource
>>>>>>>> states that exist in
>>>>>>>> this scenario?
>>>>>>>>
>>>>>>>> In your view of the web, is there a notion of stateful resource?
>>>>>>>> Does it apply here?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Luc
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 31/05/11 23:57, Graham Klyne wrote:
>>>>>>>>
>>>>>>>>> Luc Moreau wrote:
>>>>>>>>>
>>>>>>>>>> Graham,
>>>>>>>>>>
>>>>>>>>>> In my example, I really mean for the two versions of the chart to
>>>>>>>>>> be available at
>>>>>>>>>> the same URI. (So, definitely, an uncool URI!)
>>>>>>>>>>
>>>>>>>>>> In that case, there is a *single* resource, but it is stateful.
>>>>>>>>>> Hence, there
>>>>>>>>>> are two *resource states*, one generated using (stats2), and the
>>>>>>>>>> other using (stats3).
>>>>>>>>>>
>>>>>>>>> Luc,
>>>>>>>>>
>>>>>>>>> I had interpreted your scenario as using a common URI as you explain.
>>>>>>>>>
>>>>>>>>> But there are still several resources here, but they are not all
>>>>>>>>> exposed on the web or assigned URIs.  I'm appealing here to
>>>>>>>>> anything that *might* be identified as opposed to things that
>>>>>>>>> actually are assigned URIs.   (For example, the proposed duri:
>>>>>>>>> scheme might be used -
>>>>>>>>> http://tools.ietf.org/id/draft-masinter-dated-uri-07.html)
>>>>>>>>>
>>>>>>>>> (And the URI is perfectly "cool" if it is specifically intended to
>>>>>>>>> denote a dynamic resource.  A URI used to access the current
>>>>>>>>> weather in London can be stable if properly managed.)
>>>>>>>>>
>>>>>>>>> (I think this is all entirely consistent with my earlier stated
>>>>>>>>> positions.)
>>>>>>>>>
>>>>>>>>> #g
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Of course, if blogger had used cool uris, then, c2s2 and c2s3
>>>>>>>>>> would be different resources.
>>>>>>>>>>
>>>>>>>>>> Luc
>>>>>>>>>>
>>>>>>>>>> On 05/31/2011 02:25 PM, Graham Klyne wrote:
>>>>>>>>>>
>>>>>>>>>>> I see (at least) two resources associated with (c2):  one
>>>>>>>>>>> generated using (stats2), and other using (stats3).  We might
>>>>>>>>>>> call these (c2s2) and (c2s3).
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> --
>>> Professor Luc Moreau               Electronics and Computer Science   tel:   +44 23 8059 4487         University of Southampton          fax:   +44 23 8059 2865         Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk <mailto:l.moreau@ecs.soton.ac.uk>  United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>



-- 
Dr Simon Miles
Lecturer, Department of Informatics
Kings College London, WC2R 2LS, UK
+44 (0)20 7848 1166
Received on Sunday, 5 June 2011 20:55:26 UTC