- From: Graham Klyne <GK@ninebynine.org>
- Date: Fri, 05 Aug 2011 16:32:26 +0100
- To: "Myers, Jim" <MYERSJ4@rpi.edu>
- CC: Luc Moreau <L.Moreau@ecs.soton.ac.uk>, public-prov-wg@w3.org
Intuitively, I agree(?) that a derivation relation should stand independently of any particular account of a process execution. ... What follows is probably completely irrelevant; feel free to ignore. Is there a kind of dual to consider here? Accounts handle (roughly, as I understand them) process execution descriptions at different levels of granularity, and how they interact with entities. But I can also imagine different levels of granularity for entities. To take an extreme case: (the universe yesterday) isDerivedFrom (the big bang) (the universe today) isDerivedFrom (the universe yesterday) which might be taken to entail: (the universe today) isDerivedFrom (the big bang) Within such sweeping statements may be buried lesser truths: (me) isDerivedFrom (myFather), (myMother) etc. The question I'm stumbling toward is: if we need accounts to analyze process executions at different granularities, why do we not need something similar for entities? (Maybe its an endurant/perdurant thing?) #g -- Myers, Jim wrote: > An account -dependent definition of derivation would be much less > useful. Let me try to restate what I've been saying as a proposal that I > think will capture our collective sense of derivation and that will lead > to some clear consequences for the model. > > > > Derivation means that, independent of the granularity of the description > of provenance there exists (assuming the description is complete) a > chain of used-PE-generated relationships between the two entities. That > is B isDerivedFromA if there is a direct, directed, linear path between > them (no breaks, no temporal zig-zags). > > A <---PE.1 <---X <---PE.2<---Y<---PE.3<----B - derivation > > A<--------PE.1 > / > X > / > PE.2<---B - no derivation, though in some account A<---PE<---B > (with used occurring after generation) > > A<---PE1<-----X.1 > \ > PE2 > \ > X.2<--PE3<---B - derivation > > A <----PE1 <--- X.1 > > X.2<---PE2<----B - no derivation, though in some > account A<---PE1<---X<---PE2<---B (with no processes internal to X such > that one part influences the other) > > > The idea of this is consistent with our discussions - if A was used > after B was generated, there is no such path between them and an account > with finer granularity of the PE could show it to be multiple > subprocesses with A used by one that occurs after the one that generates > B (there's no path, or the path has one or more links in the opposite > temporal direction. The part/attribute cases also fit - going to finer > granularity would show there's no such path. > > The consequences of this definition are that, without additional > information (which we may want to add in the model - see below), > derivation cannot be derived from a used-PE-generated structure in a > given account, and derivation is not transitive. > > Since we have groups and use cases where either or both of these would > be useful, we can ask what else is needed to allow them: > > To allow derivation to be inferable, we need to know the connectivity > inside a PE. For a PE with m inputs and n outputs, this is an n x m > matrix if we want full detail, but it may be sufficient for most cases > to simply label the PE as fully connected or not. I would be inclined to > make the default 'fully connected' and to allow a PE to be tagged as > 'less-than-fully-connected'/'composite'/'decomposable' to stop > inferencing about derivation. I think this default would be a good > 80-90% solution while also allowing users/asserters to rigorously > indicate when inference is not appropriate. Asserters would always be > able to directly assert derivation and/or decompose partially connected > PEs to allow the inferences that are valid while still indicating which > ones are not. > > A similar mechanism would apply for Bobs and allow transitivity for > derivation - A Bob that is fully connected - it is either one thing, or > the parts of the thing interact (there are internal processes in B that > if expanded would show that there is a path from C to A) - is sufficient > to support transitivity. Again, I would argue that the default might be > 'fully-connected' since that is likely to be the most popular case, and > one could label a Bob as 'composite'/'not well mixed'/'not fully > integrated'/'decomposable' to stop transitivity. > > The benefits of this proposal are that I think it really captures the > essence of what we all think of as derivation in a rigorous way that is > not asserter/account dependent. It also focuses directly on the > provenance graph versus trying to relate attributes of Bobs or the > nature (as described in a recipe) of a PE. > > Choosing the defaults to be fully connected would open the door for > incorrect assumptions given an open world - if you're missing the > 'composite' label on a PE or Bob, you may infer derivation where there > isn't any - these defaults would return the largest potential set of > derivations given current knowledge. I suspect that this is actually the > right way to 'err' - I'd rather have false positives than to flip the > situation and be unable to find everything that something truly was > derived from. If one inferred a derivation that looked odd, one would > simply walk the chain of Bobs and PEs to see if there's evidence that > one or more of them might not be fully connected (i.e. one could look in > other accounts) along with checking to see if one or more of the > provenance statements is simply wrong (A was not an input!). > > I think one can define a useful equivalent for the distinction between > isDerivedFrom and isDerivedFromMultipleSteps as we're talking about them > today. I would suggest an account-dependent isDirectlyDerivedFrom(x,y, > account z) relationship that would be limited to one hop in a specified > account. This would really just be a convenience for scoping a query > then and it would have a clear relationship to isDerivedFrom. > isDirectlyDerivedFrom could be asserted and/or inferred from a single > used-PE-generated structure (same definition as above). isDerivedFrom is > always inferable from isDirectlyDerivedFrom). (Both the assertion and > inference of isDirectlyDerivedFrom are account dependent, but the > dependency is just about the "Direct" part - it would always be true in > any account that if you have an isDirectlyDerivedFrom relationship in > one account, there is an isDerivedFrom relationship between the same > entities that is account-independent). A consequence is that > isDerivedFrom would just be the superset of isDirectlyDerivedFrom and > the additional relationships one gets from transitivity. If the > asserters view of granularity fits yours, isDirectlyDerivedFrom would > give you a useful subset of the overall derivation graph. If not, you > just look directly at isDerivedFrom. > > > Cheers, > Jim > >> -----Original Message----- >> From: public-prov-wg-request@w3.org [mailto:public-prov-wg- >> request@w3.org] On Behalf Of Luc Moreau >> Sent: Friday, August 05, 2011 2:52 AM >> To: public-prov-wg@w3.org >> Subject: [Spam:****** SpamScore] Re: PROV-ISSUE-67 (single-execution): > Why >> is there a difference in what is represented by one vs multiple > executions? >> [Conceptual Model] >> >> Hi Simon, >> >> Your proposal is broadly inline with what I am currently drafting. >> >> Thanks for the name suggestion, which I will shamelessly borrow ;-) >> >> It is unclear to me at this stage, whether the first definition of > derivation is >> dependent on account or not, but I made an explicit note about it in > the draft >> document. This will have to be discussed again in the next iteration. >> >> Luc >> >> >> On 04/08/11 18:12, Simon Miles wrote: >>> Hi Luc, >>> >>> OK. I believe that the current definitions do not fully capture what >>> I've understood from your mails, so if I was clarifying the document >>> based on my current understanding, I would start by refining the >>> definitions (and rearranging the existing text to fit): >>> >>> "That characterized thing B _is derived from_ another characterized >>> thing A means that B is transformed from, created from, or affected > by >>> A. In particular, this means that the values of some attributes of B >>> are at least partially determined by the values of some attributes > of >>> A. >>> >>> xxx (B, A) represents that B is derived from A, and if P is the >>> process execution generating B by the account in which the > derivation >>> is asserted, then P is the execution which used A and derived B from >>> it. >>> >>> yyy (B, A) represents that B is derived from A, by any means whether >>> direct or convoluted, and regardless of any other assertion made. >>> >>> For the account in which yyy (A, B) is asserted to be consistent > then, >>> within that account, it is implied that either xxx (A, B) also holds >>> or there are multiple process executions ultimately using B and >>> generating A through a chain of use and generation relations." >>> >>> xxx is currently called isDerivedFrom and yyy is called >>> isDerivedFromInMultipleSteps. >>> >>> I fear that xxx is impossible to understand properly without > including >>> accounts, and consistency within accounts, in the model. Once we >>> introduce accounts, it then makes sense. >>> >>> Assuming we don't want to introduce accounts into the current draft, > I >>> propose something like the following: >>> >>> - isDerivedFromInMultipleSteps (yyy) is renamed > isEventuallyDerivedFrom >>> - isEventuallyDerivedFrom is defined as for yyy above, removing > the >>> paragraph below mentioning accounts until accounts are introduced >>> - isDerivedFrom (xxx) is excluded from the model until accounts > are >> introduced >>> - isDerivedFrom+ is also excluded until accounts are introduced, > as >>> it depends on isDerivedFrom >>> >>> I don't like the proposal as it removes isDerivedFrom, but I can't > see >>> how we can define isDerivedFrom in a way which reflects your > intention >>> without introducing accounts. Otherwise, the implication that will > be >>> drawn (and has been by several people in discussing this issue) is >>> that there is some implied notion of "atomic process executions". >>> >>> Thanks, >>> Simon >>> >>> On 3 August 2011 22:56, Luc Moreau<L.Moreau@ecs.soton.ac.uk> wrote: >>> >>>> Hi Simon, >>>> >>>> It's good to see that we understand each other's definition of > derivation. >>>> Given what you say about your notion of derivation, isn't it > similar >>>> to isDerivedFromInMultipleSteps? >>>> >>>> I wonder whether we should find a better terminology for these > relations. >>>> Luc >>>> >>>> On 03/08/11 16:59, Simon Miles wrote: >>>> >>>>> Hi Luc, >>>>> >>>>> Sorry, just catching up with these mails. Your explanation helps a >>>>> lot. In particular, I think the critical point which clarifies my >>>>> confusion is the following: >>>>> >>>>> >>>>> >>>>>> Asserting that >>>>>> isDerivedFrom(report2, data2) >>>>>> would be very different. It would mean that the process execution >>>>>> that generated report2 also used data2. >>>>>> >>>>>> >>>>> I have always understood isDerivedFrom (A, B) as saying that "A > was >>>>> derived from B, regardless of any other assertion I make", which >>>>> could be expressed as "there is a conceivably assertable process >>>>> execution which used B and generated A". >>>>> >>>>> You are instead saying isDerivedFrom (A, B) means "A was derived >>>>> from B, and if I assert A as being generated by a process > execution, >>>>> that was the execution which used B and led to A being derived > from it". >>>>> I agree these are semantically different. You are taking >>>>> "use+generate" as fundamental, where "derived" implies a process >>>>> which uses B and generates A takes place, so consistency within an >>>>> account requires that the process which generates A is the same > that >>>>> is implied by derivation. >>>>> >>>>> I interpreted "derived" as fundamental itself and an independent >>>>> assertion, so consistency in an account is given by this >>>>> independence, i.e. by saying "derived" you are not implying a >>>>> process in the same account anyway. And the independence of the >>>>> assertion means that it does not even make sense to consider it in >>>>> conjunction with the "generates" assertion (if it exists). >>>>> >>>>> thanks, >>>>> Simon >>>>> >>>>> On 1 August 2011 23:59, Luc Moreau<L.Moreau@ecs.soton.ac.uk> > wrote: >>>>> >>>>>> Hi Simon, >>>>>> >>>>>> That's a good example, thanks! >>>>>> >>>>>> Let me try and explain, how I see it: >>>>>> >>>>>> With >>>>>> >>>>>> isDerivedFrom (report1, data1) >>>>>> >>>>>> the asserter has a deep knowledge of the process execution that >>>>>> underpins this derivation. In particular, it is PE workflow1 that >>>>>> generates report1, and uses data1. Hence, both the generation > event >>>>>> for report1 and the use event for data1 occur during workflow1. >>>>>> >>>>>> In the provenance challenge, when you were using slicing > techniques >>>>>> to extract derivations from process definitions, I would argue > you >>>>>> were generating similar derivations. >>>>>> >>>>>> With >>>>>> >>>>>> isDerivedFromInMultipleSteps (report2, data2) >>>>>> >>>>>> the asserter is much less precise, and does not state whether a >>>>>> single process is involved for generation/use, and which interval >>>>>> they occur in. >>>>>> >>>>>> Furthermore, in this example, with the provenance given, one >>>>>> cannot ascertain whether 'unpublished2' is in the derivation path >>>>>> between report2 and data2. >>>>>> >>>>>> A stronger provenance would have been >>>>>> >>>>>> isDerivedFrom (report2, unpublished2) >>>>>> >>>>>> isDerivedFrom(unpublished2, data2) >>>>>> >>>>>> >>>>>> from which we can infer by transitive closure >>>>>> >>>>>> isDerivedFrom+ (report2, data2) >>>>>> >>>>>> >>>>>> So, to me, >>>>>> 1. isDerivedFrom is fundamental in the model, and requires > deep/precise >>>>>> knowledge of process executions. >>>>>> 2. isDerivedFrom+ is useful inference (transitive closure). >>>>>> 3. isDerivedFromInMultipleSteps is convenience assertion, but not >>>>>> as precise as 1&2. >>>>>> >>>>>> We could drop 3, but then, you wouldn't be able to express your >>>>>> second example. >>>>>> >>>>>> Asserting that >>>>>> isDerivedFrom(report2, data2) >>>>>> would be very different. It would mean that the process execution >>>>>> that generated >>>>>> report2 also used data2. >>>>>> >>>>>> So, >>>>>> >>>>>> used (workflow1.2, data2, r) for some role r. >>>>>> >>>>>> But that's not the intent. >>>>>> >>>>>> What do you think? >>>>>> Regards, >>>>>> Luc >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 01/08/11 16:53, Simon Miles wrote: >>>>>> >>>>>> >>>>>>> Hi Luc, >>>>>>> >>>>>>> OK. Here's my stab at an motivating example. >>>>>>> >>>>>>> An organisation, Org, wants to use the WG standard to record and >>>>>>> provide access to provenance data on the documents it makes >>>>>>> available online to its clients. It has storage limits on the >>>>>>> provenance it can maintain. >>>>>>> >>>>>>> Alice regularly receives government data sets and for each, >>>>>>> creates a report which is published online. Looking for a > minimal >>>>>>> way to express this using PIL, Org decides on one BOB for each >>>>>>> data set, one for each report, one process representing the >>>>>>> create-and-publish workflow, and a derivation link to show that >>>>>>> the report is based on the data set. A given instance of this, > for one data >> set, is: >>>>>>> bob (data1, [ type: "File", location: "/shared/crime1.data" > ]) >>>>>>> bob (report1, [ type: "File", location: >>>>>>> "http://example.com/report1.pdf", creator: "Alice" ]) >>>>>>> processExecution (workflow1, create-and-publish, t) >>>>>>> isGeneratedBy (report1, workflow1, out) >>>>>>> used (workflow1, data1, in) >>>>>>> isDerivedFrom (report1, data1) >>>>>>> >>>>>>> A client, Clive, finds a mistake in report1, looks at the >>>>>>> provenance and, being "creator", Alice gets the blame. However, >>>>>>> the error is actually due to Bob, who published Alice's report, >>>>>>> messing up the axes on a graph. To avoid Alice's anger, Org > agrees >>>>>>> to refine what is modelled to a finer granularity: create, then >>>>>>> publish. As they have storage constraints, they will make >>>>>>> available only one granularity of provenance information, and > use >>>>>>> this finer granularity only for subsequent reports. A given > instance would >> be: >>>>>>> bob (data2, [ type: "File", location: "/shared/crime2.data" > ]) >>>>>>> bob (unpublished2, [ type: "File", location: >>>>>>> "/shared/report2.pdf", >>>>>>> creator: "Alice" ]) >>>>>>> bob (report2, [ type: "File", location: >>>>>>> "http://example.com/report2.pdf", creator: "Alice", publisher: > "Bob" >>>>>>> ]) >>>>>>> processExecution (workflow1.1, create, t) >>>>>>> processExecution (workflow1.2, publish, t+4) >>>>>>> isGeneratedBy (unpublished2, workflow1.1, out) >>>>>>> isGeneratedBy (report2, workflow1.2, out) >>>>>>> used (workflow1.1, data2, in) >>>>>>> used (workflow1.2, unpublished2, in) >>>>>>> isDerivedFromInMultipleSteps (report2, data2) >>>>>>> >>>>>>> Clive queries to find out what data sets the reports available > are >>>>>>> derived from. He finds that while report1 is derived from data1 > in >>>>>>> one step (isDerivedFrom), report2 is derived from data2 in >>>>>>> multiple steps (isDerivedFromInMultipleSteps). He (like me) does >>>>>>> not understand how he should interpret the distinction between > the >>>>>>> two. There is apparently something different in the way that >>>>>>> report2 is related to >>>>>>> data2 compared to how report1 is derived from data1, and > possibly >>>>>>> he should trust report2 less because of this indirect link to > its >>>>>>> source data. But Org is adamant that nothing has changed in > their >>>>>>> procedures, and there is no distinction. >>>>>>> >>>>>>> Thanks, >>>>>>> Simon >>>>>>> >>>>>>> On 1 August 2011 12:15, Luc Moreau<L.Moreau@ecs.soton.ac.uk> >> wrote: >>>>>>> >>>>>>> >>>>>>>> Hi Simon, >>>>>>>> >>>>>>>> Sorry, but I don't understand. Your initial example was not >>>>>>>> valid because you had two PEs generating a single BOB. >>>>>>>> >>>>>>>> If they are different ways of describing something happening in >>>>>>>> the world, I assume that you will identify different > activities, >>>>>>>> and hence multiple process executions will be asserted. >>>>>>>> >>>>>>>> Can you reformulate an example that illustrate your concern? >>>>>>>> >>>>>>>> Luc >>>>>>>> >>>>>>>> On 08/01/2011 12:02 PM, Simon Miles wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi Luc, >>>>>>>>> >>>>>>>>> I follow your argument, but it seems tangential to my point. > The >>>>>>>>> following argument still seems inevitably true to me: >>>>>>>>> >>>>>>>>> Activity in the world that uses one BOB and generates another >>>>>>>>> *can* be described in PIL as multiple process executions or a >>>>>>>>> single process execution (regardless of whether it actually is >>>>>>>>> described in these different ways or not, or whether accounts > are >> required or not). >>>>>>>>> Therefore, what one process execution denotes is not distinct >>>>>>>>> from what multiple process executions denotes, we have just >>>>>>>>> provided more detail in the latter description (and this > detail >>>>>>>>> is, in any case, removed when saying "is derived from"). >>>>>>>>> >>>>>>>>> Therefore, isDerivedFrom and isDerivedFromInMultipleSteps as >>>>>>>>> defined do not describe anything different in the world, so we >>>>>>>>> have two terms for representing the same thing. >>>>>>>>> >>>>>>>>> I know that we've debated this or similar before, but it is >>>>>>>>> still not clear to me where the fault lies in my argument, or >>>>>>>>> what isDerivedFromInMultipleSteps really represents. If it's >>>>>>>>> only me that's confused, I understand there are more urgent >>>>>>>>> concerns (though I'd still like to understand). >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Simon >>>>>>>>> >>>>>>>>> On 1 August 2011 09:25, Luc Moreau<L.Moreau@ecs.soton.ac.uk> >> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi Simon, >>>>>>>>>> >>>>>>>>>> If I understand you correctly, you are suggesting that the >>>>>>>>>> following two assertions hold together. >>>>>>>>>> >>>>>>>>>> isGeneratedBy(e5,pe5,out) >>>>>>>>>> isGeneratedBy(e5,pe4,out) >>>>>>>>>> >>>>>>>>>> But this is not legal, since it is stated that one BOB is >>>>>>>>>> generated by at most one process execution. >>>>>>>>>> >>>>>>>>>> What you are suggesting should be encoded in a separate > account >>>>>>>>>> (though we have not defined this yet!). >>>>>>>>>> A one-step derivation then expands to one process execution > in >>>>>>>>>> a given account. >>>>>>>>>> In a separate account, there may be a multi-step derivation >>>>>>>>>> between the same two BOBs and it would expand into multiple >>>>>>>>>> process executions. >>>>>>>>>> >>>>>>>>>> Does it make sense? >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Luc >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 07/29/2011 05:52 PM, Provenance Working Group Issue > Tracker >> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> PROV-ISSUE-67 (single-execution): Why is there a difference > in >>>>>>>>>>> what is represented by one vs multiple executions? > [Conceptual >>>>>>>>>>> Model] >>>>>>>>>>> >>>>>>>>>>> http://www.w3.org/2011/prov/track/issues/67 >>>>>>>>>>> >>>>>>>>>>> Raised by: Simon Miles >>>>>>>>>>> On product: Conceptual Model >>>>>>>>>>> >>>>>>>>>>> By the definition, "a process execution represents an > identifiable >> activity". This does not seem to preclude one process execution > assertion >> denoting, at a coarse granularity, the same events in the world > denoted by >> multiple process executions in other assertions. >>>>>>>>>>> If so, then in the File Scenario example, I could add a > coarse- >> grained process execution representing the whole e1-to-e5 activity: >>>>>>>>>>> processExecution(pe5,collaboratively-edit,t) >>>>>>>>>>> uses(pe5,e1,in) >>>>>>>>>>> isGeneratedBy(e5,pe5,out) >>>>>>>>>>> >>>>>>>>>>> But then Section 5.5.2 distinguishes between "a single > process >> execution" and "one or more process executions". Following the > argument >> above, these could represent exactly the same occurrences in the > world. >>>>>>>>>>> So there is no difference between what is denoted by one and >> multiple process executions, and so no difference between > isDerivedFrom and >> isDerivedFromInMultipleSteps as described. Whether e5 was derived from > e1 >> appears to me to be entirely independent of how many process > executions >> were involved. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Professor Luc Moreau >>>>>>>>>> Electronics and Computer Science tel: +44 23 8059 4487 >>>>>>>>>> University of Southampton fax: +44 23 8059 2865 >>>>>>>>>> Southampton SO17 1BJ email: > l.moreau@ecs.soton.ac.uk >>>>>>>>>> United Kingdom > http://www.ecs.soton.ac.uk/~lavm >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >> _______________________________________________________________ >>>>>>>>>> _______ This email has been scanned by the MessageLabs Email >>>>>>>>>> Security System. >>>>>>>>>> For more information please visit >>>>>>>>>> http://www.messagelabs.com/email >>>>>>>>>> >> _______________________________________________________________ >>>>>>>>>> _______ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Professor Luc Moreau >>>>>>>> Electronics and Computer Science tel: +44 23 8059 4487 >>>>>>>> University of Southampton fax: +44 23 8059 2865 >>>>>>>> Southampton SO17 1BJ email: > l.moreau@ecs.soton.ac.uk >>>>>>>> United Kingdom > http://www.ecs.soton.ac.uk/~lavm >>>>>>>> >>>>>>>> >>>>>>>> >> ________________________________________________________________ >> _ >>>>>>>> _____ This email has been scanned by the MessageLabs Email >>>>>>>> Security System. >>>>>>>> For more information please visit >>>>>>>> http://www.messagelabs.com/email >>>>>>>> >> ________________________________________________________________ >> _ >>>>>>>> _____ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >> ________________________________________________________________ >> ___ >>>>>> ___ This email has been scanned by the MessageLabs Email Security >>>>>> System. >>>>>> For more information please visit > http://www.messagelabs.com/email >> ________________________________________________________________ >> ___ >>>>>> ___ >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >> ________________________________________________________________ >> _____ >>>> _ This email has been scanned by the MessageLabs Email Security >>>> System. >>>> For more information please visit http://www.messagelabs.com/email >>>> >> ________________________________________________________________ >> _____ >>>> _ >>>> >>>> >>> >>> > > >
Received on Friday, 5 August 2011 15:54:25 UTC