Derivation (again) from Graham Klyne on 2012-03-15 (public-prov-wg@w3.org from March 2012)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Thu, 15 Mar 2012 17:20:25 +0000
To: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4F6224D9.80109@zoo.ox.ac.uk>
I've been thinking, on and off, about our discussions about derivation last week.

I now understand that there is an assertion:

   wasDerivedFrom(generated, used)

that cannot be inferred by assertions of the form:

   wasGeneratedBy(_, generated, activity, _,_ )
   wasUsedBy(_, activity, used, _)

The current approach is to introduce a more detailed form of derivation:

   wasDerivedFrom(generated, used, activity, generationevent, usagevent)

It seems to me that this more detailed notion of derivation, which AFAICT 
overlaps a lot of existing expressive capability, is required to make up for a 
lack of sufficiently fine-grained description of the activities.

For a concrete example, consider an audio-visual conversion process with two 
inputs and two outputs:

activity(avconversion)
entity(audio_in)
entity(video_in)
entity(audio_out)
entity(video_out)
used(_,avconversion,audio_in,[port="ain"])
used(_,avconversion,video_in,[port="vin"])
wasgeneratedBy(_,audio_out,avconversion,[port="aout"])
wasgeneratedBy(_,video_out,avconversion,[port="vout"])

Our problem is that we can't tell if video_out is derived from (or in any way 
affected by) audio_in.  If the conversion includes speech-to-text overlay onthe 
video, it might.  But if it's a simple AV amplifier, it probably doesn't.

I think what we really want is a way to describe the avconversion process in 
finer detail.  What options do we have?:

1. There's the current proposal, the extended wasDerivedFrom relationship. But 
that only tells us about a particular set of inputs and outputs, and has to be 
repeated for each invocation of the process "plan".  This is consistent with the 
provenance goals in isolation, but for workflow analysis it could be useful to 
be able to query the workflow to find more.  And it somewhat complicates the 
direct notion of derivation, and overlaps significantly with other statements 
like used and wasGeneratedBy.

2. maybe a way to link generation and usage events.  The example might be 
expanded thus:

   wasGeneratedBy(video_generation, video_out, avconversion, [port="vout"])
   wasUsedBy(video_usage, avconversion, video_in, [port="vin"])
   wasInfluencedBy(video_generation, video_usage)

3. a way to connect generation and usage events though attribute values:

   wasInfluencedBy(avconversion, [port="vout"], [port="vin"])

saying that an event associated with process avconversion with [port="vin"] can 
be considered to influence any other event with [port="vout"].  This is a 
slightly indirected case of the previous option.

4. A way to describe the internal workings of the plan that drives the activity. 
  This requires us to invoke an activity association:

   wasAssociatedWith(_, avconversion, _, avdesign, _)
   planConnects(avdesign, [port="vout"], ['port="vin"])

which is saying that any activity based on the plan 'avdesign' connects 
generation events having [port="vout"] with usage events having [port="vin"].

??? why does an association have an identifier?

...

Discussion

Intuitively, for me, case 4 is the underlying information that is being 
expressed by the extended derivation assertion.  But it does introduce a lot of 
conceptual complexity that is not otherwise needed.

I think the simplest solution is case 2, which provides a direct connection 
between usage and generation events (and which might be inferred from 
information about the plan and the event attributes.)

I claim that the three primitive assertions in case 2 could be sufficient to infer:

   wasDerivedFrom(video_out, video_in)

#g
--
Received on Thursday, 15 March 2012 17:21:56 UTC