Re: actions related to collections

Hi Tim,

Your position in favour of prov:dictionary is really clear.

Two questions:

1. Is prov:dictionary an essentially feature of prov-dm and should stay in the prov-dm document?

2.. What about Jun/Satya's request for a simple membership property? Should it be added to prov-dm?

Professor Luc Moreau
Electronics and Computer Science
University of Southampton
Southampton SO17 1BJ
United Kingdom

On 18 Apr 2012, at 23:08, "Timothy Lebo" <<>> wrote:


On Apr 18, 2012, at 4:19 PM, Luc Moreau wrote:

Dear all,

I just wanted to throw a few ideas/questions to defend collections as they currently are.

1. prov:Collection is similar to rdfs:Container [1] :
the properties rdf:_1, rdf:_2, ...[2]  map naturally to keys in prov:Collection.

I don't see how these map.
In prov:Collection, keys have values chosen by the user -- rdfs:Container imposes the rdf:_N "value" for the "key".
rdfs:Container doesn't support keys.

I think there is consensus that prov:Collection as it stands is _more_ than set membership.
I argue that this more expressive construct is incredibly useful but misleadingly named.

2. RDF collections [3] can also be described by prov:Collection, using rdf:first and rdf:rest
    as keys for a collection of two elements, and allowing nesting of collections.

Although it's true that one can reproduce an rdf:List using the current definition of prov:Collection,
I'm not sure this provides "nesting" in any useful form.
It also shows how prov:Collection is a more general construct than rdf:List.

So a few questions:

1. Is it being suggested that rdfs:Container and rdf:List are not appropriate, and we
    should look at other forms of "collections"?

I'm suggesting we rename "collection" to "dictionary". The confusion is occurring when people read prov:Collection definitions as if it is set membership, which it is not optimized for.
The capabilities that it _is_ optimized for are very useful, should stay, will be used heavily, but should be renamed to something less misleading.

2. Has the prov-o ontology encoded prov-dm collections in a way that is lightweight enough?
    Could we for instance restrict the keys to be mapped to  properties such as rdf:_1, rdf:_2?

I'm not sure why we want to contort the eloquence of the Dictionary into something that is less expressive (rdfs:Container), and which has been disregarded for practical uses during the decade that it has been available.

I however acknowledge that prov:Collection is not "natural" to model a set.


I suppose that
like  "rdf:Bag class is used conventionally to indicate to a human reader that the container is intended to be unordered",
we would need a similar notion for expressing sets with prov:Collection.

We should leave modeling sets to SIOC and RDFS and focus on giving the community something that it doesn't have -- a construct that lets us encode the provenance of function calls with multiple inputs and multiple outputs.

We don't have a set membership construct and we shouldn't encourage people to misuse a dictionary to model a set.




On 18/04/12 19:39, Stephan Zednik wrote:

On Apr 18, 2012, at 12:24 PM, Timothy Lebo wrote:

I've had similar concerns that the definitions for collections are "too heavyweight" to manage the membership of sets.

But while ignoring is name and looking at the modeling construct it provides, it's clear that this construct will be very useful in many real provenance problems (for example, the very ubiquitous need for provenance of function calls with their argument names and bindings).

Perhaps we can avoid the "too heavyweight for set membership" concerns raised by Satya and Jun by renaming what we have (prov:Collection) to something more appropriate, like prov:Dictionary?


Jim is right that you can model collections with enumerated classes, but I am not sure about stating the provenance of a collection defined by an enumerated class.

We could also define a much simpler prov:Collection class that does not force map/dictionary conventions to go along with prov:Dictionary.



On Apr 18, 2012, at 2:12 PM, Jim McCusker wrote:

I think a set of key-value pairs is what's known as a map or dictionary. A collection is a set of things with a defined membership. In OWL it would probably be represented as an enumerated class.


On Wed, Apr 18, 2012 at 1:20 PM, Jun Zhao <<>> wrote:

Dear all,

I concur with what Satya wrote. And the example I had in mind is collection type of entities on the blog sphere of the Web.

As we all know SIOC is a widely used vocabulary to describe entities in the online community sites, like blogs, wikis, etc. It has the concept of sioc:Container, which is defined as "a high-level concept used to group content Items together". The relationships between a sioc:Container and the sioc:Items or sioc:Posts that belong to it are described using sioc:container_of and sioc:has_container properties.

The provenance of a sioc:Container could be who is/are responsible for the container, who created this container, and when.

The provenance of a sioc:Post could include when the posted was published, when it was modified, by whom, based on which other posts, document or data.

As you see, I am struggling to see how the key-value pair kind of structure could play in the above simple scenario. But please correct me if I am wrong.



On 18/04/2012 18:35, Satya Sahoo wrote:
Hi all,
The issue I had raised last week is that collection is an important
provenance construct, but the assumption of only key-value pair based
collection is too narrow and the relations derivedByInsertionFrom,
Derivation-by-Removal are over specifications that are not required.

I have collected the following examples for collection, which only require
the definition of the collection in DM5 (collection of entities) and they
don't have (a) a key-value structure, and (b) derivedByInsertionFrom,
derivedByRemovalFrom relations are not needed:
1. Cell line is a collection of cells used in many biomedical experiments.
The provenance of the cell line (as a collection) include, who submitted
the cell line, what method was used to authenticate the cell line, when was
the given cell line contaminated? The provenance of the cells in a cell
line include, what is the source of the cells (e.g. organism)?

2. A patient cohort is a collection of patients satisfying some constraints
for a research study. The provenance of the cohort include, what
eligibility criteria were used to identify the cohort, when was the cohort
identified? The provenance of the patients in a cohort may include their
health provider etc.

Hope this helps our discussion.



On Thu, Apr 12, 2012 at 5:06 PM, Luc Moreau<<>>wrote:

Hi Jun and Satya,

Following today's call, ACTION-76 [1] and ACTION-77 [2] were raised
against you, as we agreed.



Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine<> | (203) 785-6330<>

PhD Student
Tetherless World Constellation
Rensselaer Polytechnic Institute<><>

Received on Thursday, 19 April 2012 05:32:38 UTC