Re: PROV-ISSUE-67 (single-execution): Why is there a difference in what is represented by one vs multiple executions? [Conceptual Model]

Hi Luc,

OK. Here's my stab at an motivating example.

An organisation, Org, wants to use the WG standard to record and
provide access to provenance data on the documents it makes available
online to its clients. It has storage limits on the provenance it can
maintain.

Alice regularly receives government data sets and for each, creates a
report which is published online. Looking for a minimal way to express
this using PIL, Org decides on one BOB for each data set, one for each
report, one process representing the create-and-publish workflow, and
a derivation link to show that the report is based on the data set. A
given instance of this, for one data set, is:

  bob (data1, [ type: "File", location: "/shared/crime1.data" ])
  bob (report1, [ type: "File", location:
"http://example.com/report1.pdf", creator: "Alice" ])
  processExecution (workflow1, create-and-publish, t)
  isGeneratedBy (report1, workflow1, out)
  used (workflow1, data1, in)
  isDerivedFrom (report1, data1)

A client, Clive, finds a mistake in report1, looks at the provenance
and, being "creator", Alice gets the blame. However, the error is
actually due to Bob, who published Alice's report, messing up the axes
on a graph. To avoid Alice's anger, Org agrees to refine what is
modelled to a finer granularity: create, then publish. As they have
storage constraints, they will make available only one granularity of
provenance information, and use this finer granularity only for
subsequent reports. A given instance would be:

  bob (data2, [ type: "File", location: "/shared/crime2.data" ])
  bob (unpublished2, [ type: "File", location: "/shared/report2.pdf",
creator: "Alice" ])
  bob (report2, [ type: "File", location:
"http://example.com/report2.pdf", creator: "Alice", publisher: "Bob"
])
  processExecution (workflow1.1, create, t)
  processExecution (workflow1.2, publish, t+4)
  isGeneratedBy (unpublished2, workflow1.1, out)
  isGeneratedBy (report2, workflow1.2, out)
  used (workflow1.1, data2, in)
  used (workflow1.2, unpublished2, in)
  isDerivedFromInMultipleSteps (report2, data2)

Clive queries to find out what data sets the reports available are
derived from. He finds that while report1 is derived from data1 in one
step (isDerivedFrom), report2 is derived from data2 in multiple steps
(isDerivedFromInMultipleSteps). He (like me) does not understand how
he should interpret the distinction between the two. There is
apparently something different in the way that report2 is related to
data2 compared to how report1 is derived from data1, and possibly he
should trust report2 less because of this indirect link to its source
data. But Org is adamant that nothing has changed in their procedures,
and there is no distinction.

Thanks,
Simon

On 1 August 2011 12:15, Luc Moreau <L.Moreau@ecs.soton.ac.uk> wrote:
> Hi Simon,
>
> Sorry, but I don't understand.  Your initial example was not valid
> because you had
> two PEs generating a single BOB.
>
> If they are different ways of describing something happening in the
> world, I
> assume that you will identify different activities, and hence multiple
> process executions
> will be asserted.
>
> Can you reformulate an example that illustrate your concern?
>
> Luc
>
> On 08/01/2011 12:02 PM, Simon Miles wrote:
>> Hi Luc,
>>
>> I follow your argument, but it seems tangential to my point. The
>> following argument still seems inevitably true to me:
>>
>> Activity in the world that uses one BOB and generates another *can* be
>> described in PIL as multiple process executions or a single process
>> execution (regardless of whether it actually is described in these
>> different ways or not, or whether accounts are required or not).
>>
>> Therefore, what one process execution denotes is not distinct from
>> what multiple process executions denotes, we have just provided more
>> detail in the latter description (and this detail is, in any case,
>> removed when saying "is derived from").
>>
>> Therefore, isDerivedFrom and isDerivedFromInMultipleSteps as defined
>> do not describe anything different in the world, so we have two terms
>> for representing the same thing.
>>
>> I know that we've debated this or similar before, but it is still not
>> clear to me where the fault lies in my argument, or what
>> isDerivedFromInMultipleSteps really represents. If it's only me that's
>> confused, I understand there are more urgent concerns (though I'd
>> still like to understand).
>>
>> Thanks,
>> Simon
>>
>> On 1 August 2011 09:25, Luc Moreau<L.Moreau@ecs.soton.ac.uk>  wrote:
>>
>>> Hi Simon,
>>>
>>> If I understand you correctly, you are suggesting that the following two
>>> assertions hold together.
>>>
>>> isGeneratedBy(e5,pe5,out)
>>> isGeneratedBy(e5,pe4,out)
>>>
>>> But this is not legal, since it is stated that one BOB is generated by
>>> at most one process execution.
>>>
>>> What you are suggesting should be encoded in a separate account (though
>>> we have not defined this yet!).
>>> A one-step derivation then expands to one process execution in a given
>>> account.
>>> In a separate account, there may be a multi-step derivation between the
>>> same two BOBs and it would
>>> expand into multiple process executions.
>>>
>>> Does it make sense?
>>> Regards,
>>>
>>> Luc
>>>
>>>
>>> On 07/29/2011 05:52 PM, Provenance Working Group Issue Tracker wrote:
>>>
>>>> PROV-ISSUE-67 (single-execution): Why is there a difference in what is represented by one vs multiple executions? [Conceptual Model]
>>>>
>>>> http://www.w3.org/2011/prov/track/issues/67
>>>>
>>>> Raised by: Simon Miles
>>>> On product: Conceptual Model
>>>>
>>>> By the definition, "a process execution represents an identifiable activity". This does not seem to preclude one process execution assertion denoting, at a coarse granularity, the same events in the world denoted by multiple process executions in other assertions.
>>>>
>>>> If so, then in the File Scenario example, I could add a coarse-grained process execution representing the whole e1-to-e5 activity:
>>>>     processExecution(pe5,collaboratively-edit,t)
>>>>     uses(pe5,e1,in)
>>>>     isGeneratedBy(e5,pe5,out)
>>>>
>>>> But then Section 5.5.2 distinguishes between "a single process execution" and "one or more process executions". Following the argument above, these could represent exactly the same occurrences in the world.
>>>>
>>>> So there is no difference between what is denoted by one and multiple process executions, and so no difference between isDerivedFrom and isDerivedFromInMultipleSteps as described. Whether e5 was derived from e1 appears to me to be entirely independent of how many process executions were involved.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> Professor Luc Moreau
>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>> University of Southampton          fax:   +44 23 8059 2865
>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>>
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>>
>>
>>
>>
>
> --
> Professor Luc Moreau
> Electronics and Computer Science   tel:   +44 23 8059 4487
> University of Southampton          fax:   +44 23 8059 2865
> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>
>
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>



-- 
Dr Simon Miles
Lecturer, Department of Informatics
Kings College London, WC2R 2LS, UK
+44 (0)20 7848 1166

Received on Monday, 1 August 2011 15:53:58 UTC