Re: PROV-ISSUE-26 (uses and generates questions): How can one figure out the provenance of a given entity? from Luc Moreau on 2011-08-05 (public-prov-wg@w3.org from August 2011)

From: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
Date: Fri, 05 Aug 2011 07:50:03 +0100
To: public-prov-wg@w3.org
Message-ID: <EMEW3|068c2b2089391eeefb0e3d19b088a2c1n747o908L.Moreau|ecs.soton.ac.uk|4E3B929B>
Hi Paolo,

Many of these issues are being discussed in PROV-ISSUE-67.

In particular, Simon raised the issue of account. You need to
check the revised version of the document on Monday, which
will contain a revised presentation of derivation.

It is unclear to me at this stage, whether the definition of derivation
is dependent on account or not, but I made an explicit note about
it in the draft document.

There seems to be a desire for "short-cuts" for derivation.
Somebody may want to elaborate a proposal!

I can see some shortcoming in your option A, since a given input may be 
the cause
of several outputs, and several input-output pairs may correspond to 
different
derivations. So, I am not clear how you will encode all that with roles.

Annotating PE seems more promising (option B). But we need to think about
cardinality of inputs/outputs. Does this mean that each output is 
derived from each input?

Best regards,
Luc

On 05/08/11 06:30, Paulo Pinheiro da Silva wrote:
> Hi Luc,
>
> Please see my comments in-line below:
>> - I assume you mean can we infer that c was derived by the process
>> execution
>>
>>      Yes, this is explained in the document, and further refine in the
>> soon-to-be-released new version.
>>       Only one pe can generate c (in one account).
>>       And from a derivation from c to a, one can infer the existence 
>> of a
>> pe which generated c and  used a.
>
> Yes, this explains a lot!
>
> I understand that the model must be able to represent that a 
> derivation from 'a' to 'c' occurred through a process execution and 
> that the process execution was indeed the one called 'pe'. The fact 
> that the document explains the inference above appears to support the 
> need for such description.
>
> From your message, I see that one cannot derive that 'pe' was the 
> process execution that derived 'c' without the use of accounts -- and 
> I do not recall any group discussion of what is an account. So, this 
> suggests that we are not following the proper concept dependencies to 
> discuss these provenance concepts in a logical way -- can you see my 
> point?
>
> I further understand that the model does not only relies on accounts 
> but also relies on the use of this restriction that "an entity can 
> only be generated by one process execution" to be able to infer in our 
> example that 'pe' was the process execution that derived c. I would 
> strongly favor the adoption of constructs that are explicitly capable 
> of stating relationships between data derivations and process executions.
>
> Going back to the example (I numbered the statements to facilitate the 
> conversation):
>
> 1. uses(pe, a, r_a)
> 2. uses(pe, b, r_b)
> 3. isGeneratedBy(c,pe,r_c)
> 4. isDerivedFrom(c,a)
>
>
> I understand that most of this conversation is in support of the need 
> of representing that 'pe' has an input parameter 'b' that is not used 
> to derive 'a' (and I am using close world assumption to infer that 'c' 
> was not derived from 'b' -- is this correct?). Do we really need to 
> have all this added complexity for every single derivation encoding to 
> say that 'pe' has this additional parameter that does not affect the 
> final product of the precess execution? I would further claim that 
> most process execution inputs and outputs in real life would not 
> include entities that are not involved in derivations. There are many 
> things that we can do to simplify this model:
>
> Option A: To formalize a 'derive' role that can be used both in 'uses' 
> and 'isGeneratedBy' and to drop (4)
>
> uses (pe, a, derive)
> uses (pe, b r_b)
> isGeneratedBy(c, pe, derive)
>
> Option B: To assume that 'uses' and 'isGeneratedBy' implies derivation 
> and to add a new relationship to explicitly annotate processes 
> including the use of roles
>
> uses (pe, a)
> annotates (pe, b, r_b)
> isGeneratedBy(c, pe)
>
> In this case, we could swap the positions of 'pe' and b in case 'b' 
> was an output of 'pe'.
>
> Both options would significantly reduce most of the diagrams we have 
> built so far, what is less work for the specification of provenance, 
> without losing a single bit of information. Moreover, on top of this, 
> our definitions of 'uses' and 'isGeneratedBy' would stand on their own 
> without the need of accounts or the enforcement of restrictions such 
> as that 'c' can only be generated by 'pe' (I also have lots of things 
> to discuss in terms of this restriction in case we decide to keep the 
> current approach).
>
> I am not saying that we only have options A and B (or even that 
> options A and B are correct). We may have other options and I am just 
> proposing A and B to demonstrate the there are other ways of 
> representing provenance that may be more beneficial than the current 
> approach.
>
> Many thanks,
> Paulo.
>
>> I hope it helps,
>> Cheers,
>> Luc
>>
>> On 07/07/11 15:50, Provenance Working Group Issue Tracker wrote:
>> >  PROV-ISSUE-26 (uses and generates questions): How can one figure 
>> out the provenance of a given entity?
>> >
>> >  http://www.w3.org/2011/prov/track/issues/26
>> >
>> >  Raised by: Paulo Pinheiro da Silva
>> >  On product:
>> >
>> >  Context:
>> >  1. P uses A
>> >  2. P uses B
>> >  3. P generates C
>> >  4. C derived from A
>> >
>> >  If the provenance of C is the concern of a user of C (as opposed 
>> to the provenance of a process that generates C), one may have the 
>> following questions:
>> >
>> >  1) What the “uses” and “generates” relationships are adding to 
>> one’s understanding of C if something is wrong with C?
>> >  2) Can we infer that A was derived by the execution of process P? 
>> How?
>> >
>> >
>> >
>> >
>> >
>
>
Received on Friday, 5 August 2011 06:50:38 UTC