Re: PROV-ISSUE-67 (single-execution): Why is there a difference in what is represented by one vs multiple executions? [Conceptual Model]

Hi Luc,

OK. I believe that the current definitions do not fully capture what
I've understood from your mails, so if I was clarifying the document
based on my current understanding, I would start by refining the
definitions (and rearranging the existing text to fit):

"That characterized thing B _is derived from_ another characterized
thing A means that B is transformed from, created from, or affected by
A. In particular, this means that the values of some attributes of B
are at least partially determined by the values of some attributes of
A.

xxx (B, A) represents that B is derived from A, and if P is the
process execution generating B by the account in which the derivation
is asserted, then P is the execution which used A and derived B from
it.

yyy (B, A) represents that B is derived from A, by any means whether
direct or convoluted, and regardless of any other assertion made.

For the account in which yyy (A, B) is asserted to be consistent then,
within that account, it is implied that either xxx (A, B) also holds
or there are multiple process executions ultimately using B and
generating A through a chain of use and generation relations."

xxx is currently called isDerivedFrom and yyy is called
isDerivedFromInMultipleSteps.

I fear that xxx is impossible to understand properly without including
accounts, and consistency within accounts, in the model. Once we
introduce accounts, it then makes sense.

Assuming we don't want to introduce accounts into the current draft, I
propose something like the following:

 - isDerivedFromInMultipleSteps (yyy) is renamed isEventuallyDerivedFrom
 - isEventuallyDerivedFrom is defined as for yyy above, removing the
paragraph below mentioning accounts until accounts are introduced
 - isDerivedFrom (xxx) is excluded from the model until accounts are introduced
 - isDerivedFrom+ is also excluded until accounts are introduced, as
it depends on isDerivedFrom

I don't like the proposal as it removes isDerivedFrom, but I can't see
how we can define isDerivedFrom in a way which reflects your intention
without introducing accounts. Otherwise, the implication that will be
drawn (and has been by several people in discussing this issue) is
that there is some implied notion of "atomic process executions".

Thanks,
Simon

On 3 August 2011 22:56, Luc Moreau <L.Moreau@ecs.soton.ac.uk> wrote:
> Hi Simon,
>
> It's good to see that we understand each other's definition of derivation.
>
> Given what you say about your notion of derivation, isn't it similar to
> isDerivedFromInMultipleSteps?
>
> I wonder whether we should find a better terminology for these relations.
>
> Luc
>
> On 03/08/11 16:59, Simon Miles wrote:
>> Hi Luc,
>>
>> Sorry, just catching up with these mails. Your explanation helps a
>> lot. In particular, I think the critical point which clarifies my
>> confusion is the following:
>>
>>
>>> Asserting that
>>>   isDerivedFrom(report2, data2)
>>> would be very different. It would mean that the process execution that
>>> generated report2 also used data2.
>>>
>> I have always understood isDerivedFrom (A, B) as saying that "A was
>> derived from B, regardless of any other assertion I make", which could
>> be expressed as "there is a conceivably assertable process execution
>> which used B and generated A".
>>
>> You are instead saying isDerivedFrom (A, B) means "A was derived from
>> B, and if I assert A as being generated by a process execution, that
>> was the execution which used B and led to A being derived from it".
>>
>> I agree these are semantically different. You are taking
>> "use+generate" as fundamental, where "derived" implies a process which
>> uses B and generates A takes place, so consistency within an account
>> requires that the process which generates A is the same that is
>> implied by derivation.
>>
>> I interpreted "derived" as fundamental itself and an independent
>> assertion, so consistency in an account is given by this independence,
>> i.e. by saying "derived" you are not implying a process in the same
>> account anyway. And the independence of the assertion means that it
>> does not even make sense to consider it in conjunction with the
>> "generates" assertion (if it exists).
>>
>> thanks,
>> Simon
>>
>> On 1 August 2011 23:59, Luc Moreau<L.Moreau@ecs.soton.ac.uk>  wrote:
>>
>>> Hi Simon,
>>>
>>> That's a good example, thanks!
>>>
>>> Let me try and explain, how I see it:
>>>
>>> With
>>>
>>> isDerivedFrom (report1, data1)
>>>
>>> the asserter has a deep knowledge of the process execution that underpins
>>> this derivation. In particular, it is PE workflow1 that generates
>>> report1, and
>>> uses data1. Hence, both the generation event for report1 and the use event
>>> for data1 occur during workflow1.
>>>
>>> In the provenance challenge, when you were using slicing techniques to
>>> extract derivations from process
>>> definitions, I would argue you were generating similar derivations.
>>>
>>> With
>>>
>>> isDerivedFromInMultipleSteps (report2, data2)
>>>
>>> the asserter is much less precise, and does not state whether a single
>>> process
>>> is involved for generation/use, and which interval they occur in.
>>>
>>> Furthermore, in this example, with the provenance given,  one cannot
>>> ascertain
>>> whether 'unpublished2' is in the derivation path between report2 and data2.
>>>
>>> A stronger provenance would have been
>>>
>>> isDerivedFrom (report2, unpublished2)
>>>
>>> isDerivedFrom(unpublished2, data2)
>>>
>>>
>>> from which we can infer by transitive closure
>>>
>>> isDerivedFrom+ (report2, data2)
>>>
>>>
>>> So, to me,
>>> 1. isDerivedFrom is fundamental in the model, and requires deep/precise
>>>      knowledge of process executions.
>>> 2. isDerivedFrom+ is useful inference (transitive closure).
>>> 3. isDerivedFromInMultipleSteps is convenience assertion, but not
>>>       as precise as 1&2.
>>>
>>> We could drop 3, but then, you wouldn't be able to express your second
>>> example.
>>>
>>> Asserting that
>>>   isDerivedFrom(report2, data2)
>>> would be very different. It would mean that the process execution that
>>> generated
>>> report2 also used data2.
>>>
>>> So,
>>>
>>> used (workflow1.2, data2, r) for some role r.
>>>
>>> But that's not the intent.
>>>
>>> What do you think?
>>> Regards,
>>> Luc
>>>
>>>
>>>
>>>
>>> On 01/08/11 16:53, Simon Miles wrote:
>>>
>>>> Hi Luc,
>>>>
>>>> OK. Here's my stab at an motivating example.
>>>>
>>>> An organisation, Org, wants to use the WG standard to record and
>>>> provide access to provenance data on the documents it makes available
>>>> online to its clients. It has storage limits on the provenance it can
>>>> maintain.
>>>>
>>>> Alice regularly receives government data sets and for each, creates a
>>>> report which is published online. Looking for a minimal way to express
>>>> this using PIL, Org decides on one BOB for each data set, one for each
>>>> report, one process representing the create-and-publish workflow, and
>>>> a derivation link to show that the report is based on the data set. A
>>>> given instance of this, for one data set, is:
>>>>
>>>>     bob (data1, [ type: "File", location: "/shared/crime1.data" ])
>>>>     bob (report1, [ type: "File", location:
>>>> "http://example.com/report1.pdf", creator: "Alice" ])
>>>>     processExecution (workflow1, create-and-publish, t)
>>>>     isGeneratedBy (report1, workflow1, out)
>>>>     used (workflow1, data1, in)
>>>>     isDerivedFrom (report1, data1)
>>>>
>>>> A client, Clive, finds a mistake in report1, looks at the provenance
>>>> and, being "creator", Alice gets the blame. However, the error is
>>>> actually due to Bob, who published Alice's report, messing up the axes
>>>> on a graph. To avoid Alice's anger, Org agrees to refine what is
>>>> modelled to a finer granularity: create, then publish. As they have
>>>> storage constraints, they will make available only one granularity of
>>>> provenance information, and use this finer granularity only for
>>>> subsequent reports. A given instance would be:
>>>>
>>>>     bob (data2, [ type: "File", location: "/shared/crime2.data" ])
>>>>     bob (unpublished2, [ type: "File", location: "/shared/report2.pdf",
>>>> creator: "Alice" ])
>>>>     bob (report2, [ type: "File", location:
>>>> "http://example.com/report2.pdf", creator: "Alice", publisher: "Bob"
>>>> ])
>>>>     processExecution (workflow1.1, create, t)
>>>>     processExecution (workflow1.2, publish, t+4)
>>>>     isGeneratedBy (unpublished2, workflow1.1, out)
>>>>     isGeneratedBy (report2, workflow1.2, out)
>>>>     used (workflow1.1, data2, in)
>>>>     used (workflow1.2, unpublished2, in)
>>>>     isDerivedFromInMultipleSteps (report2, data2)
>>>>
>>>> Clive queries to find out what data sets the reports available are
>>>> derived from. He finds that while report1 is derived from data1 in one
>>>> step (isDerivedFrom), report2 is derived from data2 in multiple steps
>>>> (isDerivedFromInMultipleSteps). He (like me) does not understand how
>>>> he should interpret the distinction between the two. There is
>>>> apparently something different in the way that report2 is related to
>>>> data2 compared to how report1 is derived from data1, and possibly he
>>>> should trust report2 less because of this indirect link to its source
>>>> data. But Org is adamant that nothing has changed in their procedures,
>>>> and there is no distinction.
>>>>
>>>> Thanks,
>>>> Simon
>>>>
>>>> On 1 August 2011 12:15, Luc Moreau<L.Moreau@ecs.soton.ac.uk>    wrote:
>>>>
>>>>
>>>>> Hi Simon,
>>>>>
>>>>> Sorry, but I don't understand.  Your initial example was not valid
>>>>> because you had
>>>>> two PEs generating a single BOB.
>>>>>
>>>>> If they are different ways of describing something happening in the
>>>>> world, I
>>>>> assume that you will identify different activities, and hence multiple
>>>>> process executions
>>>>> will be asserted.
>>>>>
>>>>> Can you reformulate an example that illustrate your concern?
>>>>>
>>>>> Luc
>>>>>
>>>>> On 08/01/2011 12:02 PM, Simon Miles wrote:
>>>>>
>>>>>
>>>>>> Hi Luc,
>>>>>>
>>>>>> I follow your argument, but it seems tangential to my point. The
>>>>>> following argument still seems inevitably true to me:
>>>>>>
>>>>>> Activity in the world that uses one BOB and generates another *can* be
>>>>>> described in PIL as multiple process executions or a single process
>>>>>> execution (regardless of whether it actually is described in these
>>>>>> different ways or not, or whether accounts are required or not).
>>>>>>
>>>>>> Therefore, what one process execution denotes is not distinct from
>>>>>> what multiple process executions denotes, we have just provided more
>>>>>> detail in the latter description (and this detail is, in any case,
>>>>>> removed when saying "is derived from").
>>>>>>
>>>>>> Therefore, isDerivedFrom and isDerivedFromInMultipleSteps as defined
>>>>>> do not describe anything different in the world, so we have two terms
>>>>>> for representing the same thing.
>>>>>>
>>>>>> I know that we've debated this or similar before, but it is still not
>>>>>> clear to me where the fault lies in my argument, or what
>>>>>> isDerivedFromInMultipleSteps really represents. If it's only me that's
>>>>>> confused, I understand there are more urgent concerns (though I'd
>>>>>> still like to understand).
>>>>>>
>>>>>> Thanks,
>>>>>> Simon
>>>>>>
>>>>>> On 1 August 2011 09:25, Luc Moreau<L.Moreau@ecs.soton.ac.uk>      wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Simon,
>>>>>>>
>>>>>>> If I understand you correctly, you are suggesting that the following two
>>>>>>> assertions hold together.
>>>>>>>
>>>>>>> isGeneratedBy(e5,pe5,out)
>>>>>>> isGeneratedBy(e5,pe4,out)
>>>>>>>
>>>>>>> But this is not legal, since it is stated that one BOB is generated by
>>>>>>> at most one process execution.
>>>>>>>
>>>>>>> What you are suggesting should be encoded in a separate account (though
>>>>>>> we have not defined this yet!).
>>>>>>> A one-step derivation then expands to one process execution in a given
>>>>>>> account.
>>>>>>> In a separate account, there may be a multi-step derivation between the
>>>>>>> same two BOBs and it would
>>>>>>> expand into multiple process executions.
>>>>>>>
>>>>>>> Does it make sense?
>>>>>>> Regards,
>>>>>>>
>>>>>>> Luc
>>>>>>>
>>>>>>>
>>>>>>> On 07/29/2011 05:52 PM, Provenance Working Group Issue Tracker wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> PROV-ISSUE-67 (single-execution): Why is there a difference in what is represented by one vs multiple executions? [Conceptual Model]
>>>>>>>>
>>>>>>>> http://www.w3.org/2011/prov/track/issues/67
>>>>>>>>
>>>>>>>> Raised by: Simon Miles
>>>>>>>> On product: Conceptual Model
>>>>>>>>
>>>>>>>> By the definition, "a process execution represents an identifiable activity". This does not seem to preclude one process execution assertion denoting, at a coarse granularity, the same events in the world denoted by multiple process executions in other assertions.
>>>>>>>>
>>>>>>>> If so, then in the File Scenario example, I could add a coarse-grained process execution representing the whole e1-to-e5 activity:
>>>>>>>>       processExecution(pe5,collaboratively-edit,t)
>>>>>>>>       uses(pe5,e1,in)
>>>>>>>>       isGeneratedBy(e5,pe5,out)
>>>>>>>>
>>>>>>>> But then Section 5.5.2 distinguishes between "a single process execution" and "one or more process executions". Following the argument above, these could represent exactly the same occurrences in the world.
>>>>>>>>
>>>>>>>> So there is no difference between what is denoted by one and multiple process executions, and so no difference between isDerivedFrom and isDerivedFromInMultipleSteps as described. Whether e5 was derived from e1 appears to me to be entirely independent of how many process executions were involved.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Professor Luc Moreau
>>>>>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>>>>>> University of Southampton          fax:   +44 23 8059 2865
>>>>>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>>>>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ______________________________________________________________________
>>>>>>> This email has been scanned by the MessageLabs Email Security System.
>>>>>>> For more information please visit http://www.messagelabs.com/email
>>>>>>> ______________________________________________________________________
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Professor Luc Moreau
>>>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>>>> University of Southampton          fax:   +44 23 8059 2865
>>>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>>>>
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> This email has been scanned by the MessageLabs Email Security System.
>>>>> For more information please visit http://www.messagelabs.com/email
>>>>> ______________________________________________________________________
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>>
>>
>>
>>
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>



-- 
Dr Simon Miles
Lecturer, Department of Informatics
Kings College London, WC2R 2LS, UK
+44 (0)20 7848 1166

Received on Thursday, 4 August 2011 17:13:20 UTC