Re: PROV-ISSUE-447: subactivity relation [prov-dm]

> I don't think this example makes much sense:
>
> activity(a1,2011-11-16T00:00:00,2011-11-17T00:00:00) // in 2011
> activity(a2,2012-11-16T00:00:00,2012-11-17T00:00:00)  // in 2012
> wasSubactivity(a1,a2)

I agree this would look stupid, but we have said before that the exact
timestamps don't have any meaning in PROV-Constraints.

In particular for subactivities, it could very much happen that the
times are recorded by different mechanisms. Perhaps a difference of a
year is a glaring error, but say a few seconds off might be
acceptable.  (For instance a shell script that does an SSH to a server
that then does a wget to a web service, three different timestamps not
quite synchronized.).   Obviously this can easily be isolated using
different accounts/bundles, but as has been discussed with workflow
provenance, we often came to the conclusion that we don't want to
split every subactivity into a new bundle, as it would mean hundreds
of different standalone bundles which would be trickier to do any kind
of reasoning over.


> As indicated previously, it's a whole complete new design that
> we have to undertake, for which we don't have enough experience.

It seems that a wasSubActivity should have many of the characteristics
of specializationOf, but it raises lots of discussion points for
inferences:

* the subactivity must be fully contained within the duration of the
superactivity (This is the easy one!)
* wasAssociatedWith(ag, subAct), then wasAssociatedWith(ag, act) ?  Vice versa?
* wasGeneratedBy(e, subAct), then wasGeneratedBy(e, act) ?  Vice versa?
* used(subAct, e), then wasGeneratedBy(act, e) ?  Vice versa?
* Must subactivities be 'isolated', or are they allowed to communicate
with activities which also communicate with the superactivity?
(Imposes a theory of execution!)
* Can the superactivity communicate with the subactivity? Does it always?

So I agree it is a big can of worms. This was difficult enough to
settle for entities, now we would not only have to think about
activity-to-activity, but the implications on the other relations.


However the arguments we used for adding prov:specializationOf and
prov:alternateOf would very easily also apply to activities:
* Equivalent activities can be expressed at different granularities
(prov:wasSubActivityOf ?)
* Equivalent activities can be expressed using alternate
interpretations (prov:alternateActivity ? )

So given this, why do we allow nesting and alternatives for entities,
and not for activities?



I strongly recognize the need for the expression of subactivities -
but I am very afraid of all of these questions, and it is not like our
model is not getting complex enough already.

I would prefer to simply introduce it as a dcterms:hasPart (please,
don't use dc !) kind of notion with no particular interpretation
attached - it is simply a guide to the reader, like prov:alternateOf.
Perhaps prov:partOfActivity  to avoid the implications of "sub"?  (ie.
are you allowed to be part of multiple activities? I think we should
not restrict that.)



It still raises the question about entities generated by both
activities and the generation-uniqueness constraint.

One way around it, as I've approached it for Taverna's workflow PROV,
is to use prov:alternateOf between two entities, one per
generation/invalidation. You can picture these entities as
representing "The value as output gate X" and "The value at output
gate Y" - almost like the old prov:EntityInRole. This is the same
reasoning a washed car coming out of the last-stage
activity(polishing) and thereby completing the activity(carWashing)
can be seen as generated twice, once as "polishedCar" and once as
"washedCar" - even though there is nothing happening between the two
activities and the two entities are equivalent.

If this is the recommended approach, then it would be good to have a
property to clarify this is not just any odd alternate; say
prov:alternateInSubActivity. (as a property on the prov:Entity or a
subproperty of prov:alternateOf). Otherwise it gets tricky to query
the provenance across, we don't want to follow every odd alternate up
and down the trace. The strange thing here is that you don't *need* to
do the prov:alternateOf wrapping for usage or association. The
question also then comes to which extend to the subactivities should
always twin the entities or not.

I don't particularly like that "work around" approach for
subactivities, as it ends up making a verbose "twin world" with
alternate identifiers (which you have to mint) - effectively making an
inline bundle without clear boundaries.



The second way, much simpler and my preference, is to allow multiple
generation, but only as long as one activity is subactivity of the
other. I guess we can't infer which one is the sub and which one is
the super - so it would be a constraint rather than an inference, but
this gets tricky with the open world assumption and the use of OR/NOT.

(This can be solved by adding a prov:alternateActivityFor as a
symmetric superproperty of prov:wasSubActivityOf, then we can instead
of the constraint simply infer prov:alternateActivityFor on multiple
generations. The semantics of prov:alternativeActivityFor would be
particularly weak, similar to prov:alternativeOf.  )


This is indeed the approach we have taken for Wf4Ever's 'simplified'
workflow provenance model wfprov -
http://wf4ever.github.com/ro/#wfprov

Here wfprov:wasPartOfWorkflowRun is the workflow equivalent of
wasSubActivityOf, and both are allowed to have the same artifact (ie.
entity) as it's wfprov:wasOutputFrom. Because of this we currently we
can't make wfprov:wasOutputFrom a subproperty of prov:wasGeneratedBy
without violating PROV-Constraints. As we don't want to make a too
verbose model, we are trying to avoid adding the equivalent of
prov:alternateOf workaround I sketched above.


-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

Received on Thursday, 6 September 2012 10:12:30 UTC