Re: Fwd: Going for simplicity (was: actions related to collections)

Hi All,

In Graham's comments, he put a rough number on the number of concepts
namely 40. We are under that at 32 concepts this includes collections.
Now in the ontology we have a bit more but this is because of the
involvement pattern, which I think actually doesn't increase
complexity as the pattern is systematic.

As it stands, I think the model that the group has put together
strikes the right balance. We have a set of clear very small set of
starting points and then have some additional things that are pretty
core to provenance.

I think there's an argument to be made to put collections/dictionary
in a separate document for readability purposes but I think they
should be part of the recommendation. There's been a lot of hard work
there and the agreement seems to be that they are useful.


On Mon, Apr 30, 2012 at 10:32 AM, Luc Moreau <> wrote:
> Hi Stian,
> Answer interleaved.
> From: Stian Soiland-Reyes <>
> Date: 29 April 2012 20:44:16 GMT+01:00
> To: Graham Klyne <>
> Cc: Satya Sahoo <>, <>, Luc Moreau
> <>, Paolo Missier <>
> Subject: Going for simplicity (was: actions related to collections)
> 5, Insightful.
> I agree on the general principle of simplicity. I had similar feelings when
> wasQuoteOf and friends moved in, but have now grown to like the few
> essential "real world" relations rather than having only a (easily verbose
> and not very rich) entity-activity-agent model.
> As you point out, a richer standard will also enable richer integration for
> fewer clients.
> One way towards having many adapters, some rich, is a simple core model, and
> additional buy-in modules. The core gets everyone hooked, the modules gives
> richness by giving a standard extension, "hey, you are thinking about
> collections in your prov, how about checking out this bit over here".
> But we need to make the essential modules. OPM suggested adapters to make
> profiles and extensions, but I don't know of many such extensions in real
> life. For instance DataOne is still working on agreeing how to do workflow
> provenance using OPM.
> Modules would also work as a kind of damage control. Let's say our view of
> attribution turned out to be very wrong for digital publishing, however, our
> view of derivation was a perfect fit. Adapters could choose to use PROV
> derivations and make their own, richer attribution model. With one massive
> model, we might easily put people off if one of our aspects are
> wrong/naive/difficult compared to a domain's view.
> I believe our current components in DM can form such a modularization.
> However I have not read any recommendation about how these can be used in
> such a pick-and-choose adaption, I thought they were merely rhetorical
> groupings to ease understanding. Luc?
> Yes, I saw components as a conceptual structuring of the data model, and not
> as a way of optionally selecting which bit of the model we want to use.
> There has been (so far!) no indication from the WG that we wanted to make
> some part of the model optional.  This can be considered of course.
> But to be effective, components need to be complementary. At the moment
> derivations and responsibility are still entangled.
> I don't think it's desirable.
> Is your suggestion that we for instance have /ns/prov# (core),
> /ns/prov-attribution# etc, or simply drop everything that is not "opmv
> like"? (My question: why not then use opmv?)
> I don't think we are keen to introduce multiple namespaces.
> Luc
> --
> Stian Soiland-Reyes, myGrid team
> School of Computer Science
> The University of Manchester
> On Apr 26, 2012 6:24 PM, "Graham Klyne" <> wrote:
>> On 26/04/2012 13:39, Paolo Missier wrote:
>>> Graham
>>> you have made your point on this over and over again.
>> Yes, I've said it before, but I think not (in this context) so much to
>> count as "over and over again".  (Previously, I've objected to using
>> collections to model provenance accounts, which was a different matter.)
>>> ... I think we get it, but I
>>> still don't see a strong argument. That is because the criteria used to
>>> define
>>> the scope here have been blurry and that has not improved with time.
>>> The comments that followed my own personal opinion on this (attached)
>>> seem to
>>> indicate that capturing the evolution of sets may be a good idea, given
>>> their
>>> pervasiveness. If this belongs to a specific domain, which domain is it?
>> Fair enough.  I'll see if I can substantiate my position...
>> First, to be clear, I'm not saying that "capturing the evolution of sets"
>> is not a good idea.  What I question is the extent to which is *should* be
>> *entirely* down to the PROV spec to achieve this.
>> We're defining a standard, and I think it's in the nature of standards for
>> use on the global Internet/Web that the criteria for defining scope are
>> blurry, because we can't expect to anticipate all of the ways in which they
>> will be used.
>> For me, the acid test will be the extent of adoption.  In my experience,
>> it is the *simple* standards (of all kinds) that get more widely adopted.
>>  TCP/IP vs OSI.  SMTP vs X.400.  HTTP vs any number of content management
>> systems.
>> I see the same for ontologies/vocabularies.  The widely used success
>> stories are ones like DC, FOAF, SIOC, SKOS, etc., which all have the
>> characteristic of focusing on a small set of core concepts.  Of course there
>> are more specialized large ontologies/vocabularies that have strong
>> following (e.g. a number of bioinformatics standards), but within much more
>> confined communities.  (TimBL has a slide about costs of ontology vs size of
>> community - it emphasizes the
>> benefits of widespread adoption, but doesn't address costs associated with
>> the *size* of the ontology.)
>> In my view, provenance is something that /should/ be there with the likes
>> of DC and FOAF in terms of adoption.  Which for me prioritizes keeping it as
>> small as possible to maximize adoption.
>> To repeat: I'm not saying that provenance of collections is not useful.
>>  I'm sure it is very useful in many situations.  For me the test is not so
>> much what is useful as what *needs* to be in the base provenance spec by
>> virtue of it cannot reasonably be retro-fitted via available extension
>> points.  What I have not seen is an explanation that the provenance of
>> collections cannot be handled through specialization of the core provenance
>> concepts we already have.  This might even be a separate *standard*.
>> For me, all this is an an application of the principles of minimum power,
>> independent invention and modularity
>> (
>> In many ways (and, to be clear, this is not a proposal, just an
>> illustration) I'd rather like to see something like OPMV go forward as a
>> base spec for provenance, because it's really clear from that what are the
>> key ideas, and has they tie together.
>> Many of the things the group spends time discussing (including, but
>> limited to, collections) can be layered on this basic model.  The tension
>> here is that by specifying more in the base model, one achieves a greater
>> level of interoperability between systems *that fully implement the defined
>> model*, and at the same time decrease the number of systems that attempt to
>> implement the model.  This raises the question: is it more beneficial to
>> have a relative few systems implement a very rich model of provenance
>> interoperability, or to have very many systems implement a relatively weak
>> model?  And of course, it's not black-or-white ... there are reasonable
>> points between.   I think my view is clearly to "turn the dial" to the
>> simpler end of the spectrum but, of course, YMMV.
>>> But I am sorry that you are having to hold your nose. Believe me, the
>>> provenance
>>> of a set doesn't smell that bad.
>> That was a figure of speech, and was probably an overly strong statement.
>> As I say above, I'm sure provenance of collections of various kinds is
>> useful and important - what I'm really trying to push on is how much needs
>> to be in the base provenance specs that developers will have to master.
>> I think I later in the discussion I saw a mention of abstract collections
>> that could be specialized in different ways.  That, for me, could represent
>> a reasonable compromise, though my preference would be to deal with
>> collections separately.
>> Maybe what I'm doing here is making a case for modularization of the
>> provenance spec (ala PML?), rather lumping it all into one, er, collection.
>> ...
>> Returning to your comment about blurry criteria, here are some that are
>> not blurry (though they are also unsubstantiated, but there are some clues
>> at
>> * I think that if we can produce of base provenance ontology of <=8
>> classes <=12 properties, we stand a chance of deployment at the scale of
>> FOAF (the numbers are approximately the size of FOAF core -
>> * I think a base ontology with twice the number of classes could achieve
>> less than 10% of the adoption of FOAF (e.g. compare interest in vCard vs
>> FOAF or DC at
>> * I think a base ontology with substantially more terms will receive
>> substantially less adoption.
>> The numbers here are, to be sure, very unscientific.  But it's interesting
>> that, not counting the "infrastructure" ontologies (rdf, rdfs, owl, ex), all
>> the "high interest" ontologies that I probes were also relatively small (up
>> to 40 terms overall at a rough guess)
>> On this basis, my criterion becomes very un-blurry: fewer terms is better
>> by far.
>> Of course, there's a balance to be struck, but it brings home to me that
>> each term that is added to the overall provenance ontology has to bring
>> substantial benefit if the adoption (impact) of our work is not to be
>> reduced.
>> ...
>> Finally, the reason I think that PROV *could* be as popular as FOAF is
>> because it is positioned to underpin a key missing feature of the web -
>> providing a machine actionable basis for dealing with conflicting
>> information (trust, information quality assessment).  It could be, in a real
>> sense, the FOAF of data ("who are you?", "who do you know?", "where do you
>> come from?", etc.).
>> As yet, we don't *know* what aspects of provenance will be important in
>> this respect, though there is some research (including your own, Paolo) that
>> suggests some directions.  So, in pursuit of this goal, the thing about PROV
>> that matters almost more than anything else is scale of adoption.  So, on
>> this view, *anything* that stands in the way of adoption without providing
>> needed functionality that cannot be achived in any other way is arguably an
>> impediment to the eventual success of PROV.
>> #g
>> --
>>> On 4/26/12 12:04 PM, Graham Klyne wrote:
>>>> I find myself somewhat concerned by what appears to be scope creep
>>>> associated
>>>> with collections. It seems to me that in the area, the provenance model
>>>> is
>>>> straying in the the domain of application design. If collections were
>>>> just
>>>> sets, I could probably hold my nose and say nothing, but this talk of
>>>> having
>>>> provenance define various forms of collection indexing seems to me to be
>>>> out of
>>>> scope.
>>>> So I think this is somewhat in agreement with what Satya says here,
>>>> though I
>>>> remain unconvinced that the notions of collections and
>>>> derivation-by-insertion,
>>>> etc., actually *need* to be in the main provenance ontology - why not
>>>> let
>>>> individual applications define their own provenance extension terms?
>>>> #g
>>>> --
>>>> On 18/04/2012 17:35, Satya Sahoo wrote:
>>>>> Hi all,
>>>>> The issue I had raised last week is that collection is an important
>>>>> provenance construct, but the assumption of only key-value pair based
>>>>> collection is too narrow and the relations derivedByInsertionFrom,
>>>>> Derivation-by-Removal are over specifications that are not required.
>>>>> I have collected the following examples for collection, which only
>>>>> require
>>>>> the definition of the collection in DM5 (collection of entities) and
>>>>> they
>>>>> don't have (a) a key-value structure, and (b) derivedByInsertionFrom,
>>>>> derivedByRemovalFrom relations are not needed:
>>>>> 1. Cell line is a collection of cells used in many biomedical
>>>>> experiments.
>>>>> The provenance of the cell line (as a collection) include, who
>>>>> submitted
>>>>> the cell line, what method was used to authenticate the cell line, when
>>>>> was
>>>>> the given cell line contaminated? The provenance of the cells in a cell
>>>>> line include, what is the source of the cells (e.g. organism)?
>>>>> 2. A patient cohort is a collection of patients satisfying some
>>>>> constraints
>>>>> for a research study. The provenance of the cohort include, what
>>>>> eligibility criteria were used to identify the cohort, when was the
>>>>> cohort
>>>>> identified? The provenance of the patients in a cohort may include
>>>>> their
>>>>> health provider etc.
>>>>> Hope this helps our discussion.
>>>>> Thanks.
>>>>> Best,
>>>>> Satya
>>>>> On Thu, Apr 12, 2012 at 5:06 PM, Luc
>>>>> Moreau<>wrote:
>>>>>> Hi Jun and Satya,
>>>>>> Following today's call, ACTION-76 [1] and ACTION-77 [2] were raised
>>>>>> against you, as we agreed.
>>>>>> Cheers,
>>>>>> Luc
>>>>>> [1]
>>>>>> [2]
> --
> Professor Luc Moreau
> Electronics and Computer Science   tel:   +44 23 8059 4487
> University of Southampton          fax:   +44 23 8059 2865
> Southampton SO17 1BJ               email:
> United Kingdom           

Dr. Paul Groth (
Assistant Professor
Knowledge Representation & Reasoning Group
Artificial Intelligence Section
Department of Computer Science
VU University Amsterdam

Received on Monday, 30 April 2012 10:28:32 UTC