Re: Activity composition from Stian Soiland-Reyes on 2012-05-10 (public-prov-comments@w3.org from May 2012)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Thu, 10 May 2012 10:03:16 +0100
To: Paolo Ncl <Paolo.Missier@ncl.ac.uk>
Cc: Davide Ceolin <davide.ceolin@gmail.com>, "public-prov-comments@w3.org" <public-prov-comments@w3.org>
Message-ID: <CAPRnXtkQLOZq4TaEvgKf=7R82wii+XStmHBLaSGo=ZbeYLBsNQ@mail.gmail.com>
I would also prefer a way to talk about activity composition and
entity composition.

With Daniel and Khalid I earlier tried to reconcile how we could use
PROV to trace executions of nested scientific workflows. Let's say we
have trace of the master workflow:

wasGeneratedBy(value1, service1)
used(service2, value1)
wasGeneratedBy(value2, service2)
used(service3, value1)
used(service3, value2)
wasGeneratedBy(value3, service3)


service2 is a nested workflow, so while service1 and 3 are black
boxes, we also know the details of the 'inner workings' of service2:

wasStartedByActivity(service2a, service2)
wasStartedByActivity(service2b, service2)
used(value1, service2a)
wasGeneratedBy(internalValue, service2a)
used(value1, service2b)
used(internalValue, service2b)

The additional usage of value1 should be fine, but does not convey
that it was given to service2b by service2.


However we can't also state:

  wasGeneratedBy(value2, service2b)

This is due to the functional constraint - this would make service2b == service2



Some current workarounds:

a) Two entities, alternateOf

wasGeneratedBy(value2Inside, service2b)
alternateOf(value2, value2Inside)
wasDerivedFrom(value2, value2Inside)

I believe this is the cleanest solution. Here the derivation can be
thought of as "Moving value2 from inside to outside". I added the
derivation so that the existential link from value2Inside to value2 is
stated.

To 'close' value2Inside we can add:

wasInvalidatedBy(value2Inside, service2)



b) Two entities, common specializationOf  super-entity

wasGeneratedBy(value2Outside, service2)
wasGeneratedBy(value2Inside, service2b)
specializationOf(value2Inside, value2)
specializationOf(value2Outside, value2)
wasDerivedFrom(value2Outside, value2Inside)

The specialization here is basically 'Being inside' and 'Being
outside' - think of it as the entity being in a door opening or coming
out of a pipe. It would allow you to break down the 'transfer' as
well:

specializationOf(value2InTransit, value2)
wasDerivedFrom(value2Outside, value2InTransit)

"value2" here is the "actual", pure Platonian value, which does not
easily have a wasGeneratedBy. For computer internals it can be thought
of in terms of the abstract "The number 14" and "The bytes [20, 65,
66, 67]" - for real world examples it is "The concept of the thing".



c) Use different accounts

Each account can have different view of how value2 was created.
However, if you have many activities, iterations etc, you will get a
whole lot of accounts, and growing query and representational issues.
Merging of these accounts will be more of a challenge, as you would
have to use one of the other solutions suggested here.

We also don't have a way to say "This account shows the inner workings
of this activity".  (or can we use PROV-AQ for that?
  :activity1 prov:hasProvenance <activity1-provenance>    )


d) Drop outer wasGeneratedBy

Removing
  wasGeneratedBy(value2, service2)

But then you have not just opened the lid of service2, you have
removed the casing. This approach will mean that service2 did not have
anything to do with value2.



If we are unhappy about these kind of approaches, then I think a good
solution would be to have a construct for service composition. Then we
can lax the wasGeneratedBy functional requirement, and say that the
activities are the same, or one of the activities contain the other,
which can be expressed as some kind of "partOf" relation stronger than
wasStartedBy (without implying any tokens).

This will add complications, for instance if you have (e=entity,
a=activity, ->= generated/used):

a1 -> e1 -> a2 -> e2

and you also decompose a1 to:

e0 -> a1a -> ex -> a1b -> ey -> a1c -> e2


Now the question is where did e0 come from - was it by composition not
also used by a1? Can e0 also 'be part of a1' - an embedded entity,
like a part of the machine performing a1?

(I think the opposite case is OK, if a1 consumes e0, but not seen
inside. This could just have been used for coordination purposes by
a1).



However, I believe service composition is still easier to deal with
than a set of slightly unrelated 'mirror' entities at different
granularities, it's just a more detailed path of the same trace.

I guess one question is if it is up to the asserter or the consumer of
the provenance trace to determine the granularity. The beauty of this
approach is that the consumer can mix and match, he can go in details
for a2, but use the shortcut for a1. The asserter just says everything
he knows, including the inner workings where it is known, and outer
abstractions where they make sense.



A different solution would be to have a stronger kind of alternateOf
that includes the derivation and 'passing' nature rather than any kind
of 'change' derivation. Thus we use two entities, but have a
PROV-specific way to say 'This is the same thing, but as generated by
a different activity at a different scale'.


I believe that for almost all the examples we have, the activities
could also be expressed at a more granular level. For instance,
filling-petrol could be decomposed into opening-fuel-cap,
using-petrol-pump, closing-fuel-cap, paying.

Is our stance that such decomposition must always be done through a
separate provenance account/graph?


On Wed, May 9, 2012 at 10:47 PM, Paolo Ncl <Paolo.Missier@ncl.ac.uk> wrote:
> Davide
>
> I guess it depends on how you define "part of" in this setting. You can specify that an activity has started another, which makes, informally, the former a "parent" of the latter. You can use this to model forking, for example. This is about the observed behavior of a process and is within scope. But there is no way to express structural containment, or composition, because describing process models and structure (for instance, the structure of a program, a workflow, a script etc.) is not within the PROV scope.
> I hope others in the group concur with this interpretation
>
> Regards,
>
> P.Missier - paolo.missier@ncl.ac.uk
>
> On 7 May 2012, at 21:44, Davide Ceolin <davide.ceolin@gmail.com> wrote:
>
>> Hello,
>>
>> I am a PhD student of the VU University Amsterdam, and I would have a question about the composition of activities in PROV. I noticed that it is not possible to explicitly state that an activity is actually part of another one.
>>
>> Suppose that a given entity is the result of an activity and, in turn, this activity is part of a larger one.
>>
>> I can represent this scenario with two separate graphs stating that each of the two activities generated the entity, and from them (and their execution times, etc.) I may infer that one is part of the other one, but I can't explicitly state that.
>>
>> Is there a specific reason for such a limitation?
>>
>> Thanks,
>>
>> Davide
>>
>> Davide Ceolin MSc.
>> PhD student
>> The Network Institute
>> VU University Amsterdam
>> d.ceolin@vu.nl
>> http://www.few.vu.nl/~dceolin/
>>
>>
>>
>



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
Received on Thursday, 10 May 2012 09:04:12 UTC