RE: playing with pil ontology

Collections - as a mechanism to support integration of provenance describing different levels of entity granularity  - seems to me to be very distinct from an account/provenance container  - a mechanism to reify a set of provenance statements (to assert that the set of statements should be considered a pil:entity) and enable their provenance to be described.

For the former (generic collection/aggregation), I basically want an ability to relate an aggregate entity (a collection) to its constituent entities - if A is a list generated by one PE and B is one number from that list that is used in another PE, I want to be able to say B is a partOf A so I can follow the overall history.

For the latter (account/provenance container), I am not interested in being able to make such a connection between one provenance assertion and a set of them. Rather, as you say, I want to associate an asserter with a set of provenance statements and perhaps describe the process by which that set was created, transmitted, signed and verified, inferred from other information, etc.  An entity-to-entity 'partOf' relationship plays no role here. I want a URI for the set of provenance statements and to be able to assert that it was generated/usedby PEs and derivedfrom other entities. Further, If cryptographic signatures are going to work, we need to define a canonical byte-level serialization of an account - something that makes little sense for a generic aggregation mechanism, etc. (As being discussed in other emails, this sounds a lot like a NamedGraph, which I think is clearly quite different from a general rdf/owl collection concept...)

Jim






From: Satya Sahoo [mailto:satya.sahoo@case.edu]
Sent: Tuesday, August 16, 2011 1:25 PM
To: Myers, Jim
Cc: Deus, Helena; Khalid Belhajjame; public-prov-wg@w3.org
Subject: Re: playing with pil ontology

Hi Jim,
> I don't think we've distinguished provenance container and account at this point - they are an entity which contains provenance statements and are used to enable you to talk about how the provenance was created (what processes and inputs caused those statements to be), but collection has been discussed as a general aggregate entity/container - a bag of marbles is an entity and saying a process execution used it is shorthand for talking about the individual marbles. A file is a collection of bytes and a process execution may only use some of the bytes, etc.

Agree and would also state that the distinction between container/collection and account is also not very clear to me. The conceptual document states that account "should" have an associated asserter, but given that PIL is an assertion language it is implicit that one/multiple provenance assertions have  asserter(s).

If the information about asserter is present it helps in better understanding of the provenance assertions (deriving from the quality of the asserter - trusted/untrusted etc.) But, I think we can collapse all three concepts into a single concept of collection and leave it individual applications to explicitly associate an asserter with the collection (in many cases enough information may not be available to uniquely identify an asserter but it is still a provenance view/perspective/account).

Thoughts?

> Re: roles - I would argue that you should use something quite specific for the role of your temperature parameter, e.g. "processingtempraturesetpoint' rather than a generic "input" or "inputParameter" role (parameter might still be a supertype of processingtemperaturesetpoint).
Agree it is up to the application to define/model the level of detail they require.

Thanks.

Best,
Satya



On Mon, Aug 15, 2011 at 11:57 AM, Myers, Jim <MYERSJ4@rpi.edu<mailto:MYERSJ4@rpi.edu>> wrote:
A couple quick comments: I don't think we've distinguished provenance container and account at this point - they are an entity which contains provenance statements and are used to enable you to talk about how the provenance was created (what processes and inputs caused those statements to be), but collection has been discussed as a general aggregate entity/container - a bag of marbles is an entity and saying a process execution used it is shorthand for talking about the individual marbles. A file is a collection of bytes and a process execution may only use some of the bytes, etc.

Re: roles - I would argue that you should use something quite specific for the role of your temperature parameter, e.g. "processingtempraturesetpoint' rather than a generic "input" or "inputParameter" role (parameter might still be a supertype of processingtemperaturesetpoint). This would be necessary if, for example, your process execution had a reaction temperature and a storage temperature as inputs - now you have two numbers/two temperatures and you have to use each in the correct role for the provenance to be correct. In many cases, you could potentially describe the type of the entity itself well enough to make the provenance clear, but putting the information into the entity typing rather than into the role it has relative to the process execution causes trouble if you use the entity in multiple processes (if I make an entity that is of type "processingtemperaturesetpoint" and I have a second process that displays a "printablenumber" that uses it as input, the same entity can't also be of type "printable number" - better to make the entity have type number and play a 'processingtemperaturesetpoint" role in one process and the "printablenumber" role in the other.)

Jim

From: public-prov-wg-request@w3.org<mailto:public-prov-wg-request@w3.org> [mailto:public-prov-wg-request@w3.org<mailto:public-prov-wg-request@w3.org>] On Behalf Of Satya Sahoo
Sent: Monday, August 15, 2011 11:02 AM
To: Deus, Helena
Cc: Khalid Belhajjame; public-prov-wg@w3.org<mailto:public-prov-wg@w3.org>

Subject: Re: playing with pil ontology

Hi Lena,
Thanks again for trying to use the ontology for the microarray use case!

My comments are inline:

>I am not questioning whether agent should be mapped to agents defined elsewhere, which seems to >be obvious- only wondering whether agent "label" and "description" are things we want to standardize >in our model or not. We can "suggest" rdfs:label and rdfs:comment without enforcing it as such - >having those included in the model will likely result in much less heterogeneity when it comes to >reporting provenance (particularly since we are defining it necessarily "open" and highly granular to fit >any particular domain.

I am not sure I understand your point. The rdfs:label and rdfs:comment are two of the nine annotation properties that are part of the OWL2 syntax. So, the provenance ontology encoded in OWL includes them by default.



> What was its intended purpose/role in the description of provenance?

Provenance container, account, and collection are related concepts for modeling a collection of provenance assertions. E.g. provenance of a Affymetrix gene chip will be a collection of provenance assertions (date of manufacture, location of manufacturer, production series etc.) that can be stored in a single file and the file will be a provenance container.



>Example: a list of height measurement is an "untransformed" entity (a dataset); the average of that list >is the "transformed" entity (another dataset, although a very simple one).
>I am dealing with much more complex workflows, (e.g. files containing the outcome of a microarray >experiment as the untransformed dataset and a list of differentially expressed genes as the >transformed dataset), so please take the example above is just illustrative.

I am not sure I see the granularity/expressivity issue in the above example (from your first mail). Both the "untransformed" and "transformed" entities map to input and output data of a process execution - we can create subclass of Entity for this purpose.



>An investigator (agent) performs an experiment That experiment has several input parameters, some >of which are entities (e.g. samples), other are not (e.g. temperature) Resulting from the experiment are >several output parameters (entities)

I am confused by the above scenario. Why is temperature not an entity? Both the input (sample) and (temperature) are special types (sub class) of entities - (a) InputData and (b) InputParameter etc.


> So if I understand what you are saying correctly, "temperature" would be an entity of type "input", >which in turn would be subclass of "role". An instance of "input" could then have a certain value (e.g. >15C) in one of its properties?
>In that case, does it make sense to include "input" and "output" classes in the model as subclasses of >"role"? Or is this something that me and Stephan exemplify in the primer document under "usage of >agent" (or something of the sort)?

I agree with Khalid's example where Role allows us to model more complex scenarios. For example, X is an instance of class HumanBeing (perhaps as subclass of entity) and X has multiple roles - researcher, parent, soccer player etc. To model these "functions" we will use the Role class. I believe in the microarray scenario (in your first mail) Roles are not needed.


> In that case, does it make sense to include "input" and "output" classes in the model as >subclasses of "role"? Or is this something that me and Stephan exemplify in the primer >document under "usage of agent" (or something of the sort)?

Sorry I did not understand this. Role can be used by any entity, why only "usage of agent"?

Thanks.

Best,
Satya

On Mon, Aug 15, 2011 at 7:01 AM, Deus, Helena <helena.deus@deri.org<mailto:helena.deus@deri.org>> wrote:
Hi Khalid,
Please see comments inline

From: Khalid Belhajjame [mailto:Khalid.Belhajjame@cs.man.ac.uk<mailto:Khalid.Belhajjame@cs.man.ac.uk>]
Sent: 12 August 2011 10:22
To: Deus, Helena
Cc: public-prov-wg@w3.org<mailto:public-prov-wg@w3.org>
Subject: Re: playing with pil ontology


Hi Helena,

Thanks for this, I think that this is a good exercise and some of the point you mentioned relate to the conceptual model, not only the formal model.

On 11/08/2011 18:52, Deus, Helena wrote:
Hi all,

Reiterating a bit on what was addressed today  in the telco, I downloaded the ontology from mercurial and tried to use it with my use case.
I am using the use cases published in [1] and demoed with SPARQL at http://biordfmicroarray.googlecode.com/hg/sparql_endpoint.html

Here is my input so far:


Agent could have dataProperty "label" and "description"; it would help the implementer describe what type of agent does he/she intend to describe. Is the ontology here being confused with the query model?
I think that there was previously a long thread discussion on agent and agent types, and whether the model should be prescriptive in this respect. One of the solutions that I think many people were happy with is to leave users choose their favorite model(ontology) for agent, which means that the agent class defined in the ontology acts as a place holder that can be specialized to include description, types, and whatever the application needs.

I am not questioning whether agent should be mapped to agents defined elsewhere, which seems to be obvious- only wondering whether agent "label" and "description" are things we want to standardize in our model or not. We can "suggest" rdfs:label and rdfs:comment without enforcing it as such - having those included in the model will likely result in much less heterogeneity when it comes to reporting provenance (particularly since we are defining it necessarily "open" and highly granular to fit any particular domain.


ProvenanceContainer is not useful, or its description is not clear; what should be an instance of provenanceContainer?

At this stage, the description of this concept is not yet stable in the conceptual model as far as I know.

What was its intended purpose/role in the description of provenance?


I want to create an instance of a "untransformed" entity (in my case, a dataset) and a "transformed" entity. Is the model going to give me that granularity/expressivity or do we expect each implementer to come up with their own way of defining these?
Could you please clarify what you mean by transformed and untransformed entity?
Example: a list of height measurement is an "untransformed" entity (a dataset); the average of that list is the "transformed" entity (another dataset, although a very simple one).

I am dealing with much more complex workflows, (e.g. files containing the outcome of a microarray experiment as the untransformed dataset and a list of differentially expressed genes as the transformed dataset), so please take the example above is just illustrative.


ProcessExecution needs more expressivity, I think. Not sure how to solve this in a domain independent way, but here's my problem:

An investigator (agent) performs an experiment

That experiment has several input parameters, some of which are entities (e.g. samples), other are not (e.g. temperature).

Resulting from the experiment are several output parameters (entities)

I think that the current model caters for the above need. If you are specifically trying to differentiate between different kinds of inputs (samples as opposed to temperature), then the notion of role can be helpful in this resepect.

So if I understand what you are saying correctly, "temperature" would be an entity of type "input", which in turn would be subclass of "role". An instance of "input" could then have a certain value (e.g. 15C) in one of its properties?
In that case, does it make sense to include "input" and "output" classes in the model as subclasses of "role"? Or is this something that me and Stephan exemplify in the primer document under "usage of agent" (or something of the sort)?



Thanks, khalid

Have not completed my "experiment" yet, but will provide more feedback soon :)

Best Regards,
Helena F. Deus
Post-doctoral Researcher
Digital Enterprise Research Institute
National University of Ireland, Galway
http://lenadeus.info

Received on Tuesday, 16 August 2011 18:35:50 UTC