Re: actions related to collections from Graham Klyne on 2012-04-26 (public-prov-wg@w3.org from April 2012)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Thu, 26 Apr 2012 18:20:52 +0100
To: Paolo Missier <Paolo.Missier@ncl.ac.uk>
CC: Satya Sahoo <satya.sahoo@case.edu>, Luc Moreau <L.Moreau@ecs.soton.ac.uk>, Provenance Working Group WG <public-prov-wg@w3.org>
Message-ID: <4F9983F4.2030302@zoo.ox.ac.uk>
On 26/04/2012 13:39, Paolo Missier wrote:
> Graham
>
> you have made your point on this over and over again.

Yes, I've said it before, but I think not (in this context) so much to count as 
"over and over again".  (Previously, I've objected to using collections to model 
provenance accounts, which was a different matter.)

> ... I think we get it, but I
> still don't see a strong argument. That is because the criteria used to define
> the scope here have been blurry and that has not improved with time.
> The comments that followed my own personal opinion on this (attached) seem to
> indicate that capturing the evolution of sets may be a good idea, given their
> pervasiveness. If this belongs to a specific domain, which domain is it?

Fair enough.  I'll see if I can substantiate my position...

First, to be clear, I'm not saying that "capturing the evolution of sets" is not 
a good idea.  What I question is the extent to which is *should* be *entirely* 
down to the PROV spec to achieve this.

We're defining a standard, and I think it's in the nature of standards for use 
on the global Internet/Web that the criteria for defining scope are blurry, 
because we can't expect to anticipate all of the ways in which they will be used.

For me, the acid test will be the extent of adoption.  In my experience, it is 
the *simple* standards (of all kinds) that get more widely adopted.  TCP/IP vs 
OSI.  SMTP vs X.400.  HTTP vs any number of content management systems.

I see the same for ontologies/vocabularies.  The widely used success stories are 
ones like DC, FOAF, SIOC, SKOS, etc., which all have the characteristic of 
focusing on a small set of core concepts.  Of course there are more specialized 
large ontologies/vocabularies that have strong following (e.g. a number of 
bioinformatics standards), but within much more confined communities.  (TimBL 
has a slide about costs of ontology vs size of community 
http://www.w3.org/2006/Talks/0314-ox-tbl/#(22) - it emphasizes the benefits of 
widespread adoption, but doesn't address costs associated with the *size* of the 
ontology.)

In my view, provenance is something that /should/ be there with the likes of DC 
and FOAF in terms of adoption.  Which for me prioritizes keeping it as small as 
possible to maximize adoption.

To repeat: I'm not saying that provenance of collections is not useful.  I'm 
sure it is very useful in many situations.  For me the test is not so much what 
is useful as what *needs* to be in the base provenance spec by virtue of it 
cannot reasonably be retro-fitted via available extension points.  What I have 
not seen is an explanation that the provenance of collections cannot be handled 
through specialization of the core provenance concepts we already have.  This 
might even be a separate *standard*.

For me, all this is an an application of the principles of minimum power, 
independent invention and modularity 
(http://www.w3.org/DesignIssues/Principles.html).

In many ways (and, to be clear, this is not a proposal, just an illustration) 
I'd rather like to see something like OPMV go forward as a base spec for 
provenance, because it's really clear from that what are the key ideas, and has 
they tie together.

Many of the things the group spends time discussing (including, but limited to, 
collections) can be layered on this basic model.  The tension here is that by 
specifying more in the base model, one achieves a greater level of 
interoperability between systems *that fully implement the defined model*, and 
at the same time decrease the number of systems that attempt to implement the 
model.  This raises the question: is it more beneficial to have a relative few 
systems implement a very rich model of provenance interoperability, or to have 
very many systems implement a relatively weak model?  And of course, it's not 
black-or-white ... there are reasonable points between.   I think my view is 
clearly to "turn the dial" to the simpler end of the spectrum but, of course, YMMV.

> But I am sorry that you are having to hold your nose. Believe me, the provenance
> of a set doesn't smell that bad.

That was a figure of speech, and was probably an overly strong statement.

As I say above, I'm sure provenance of collections of various kinds is useful 
and important - what I'm really trying to push on is how much needs to be in the 
base provenance specs that developers will have to master.

I think I later in the discussion I saw a mention of abstract collections that 
could be specialized in different ways.  That, for me, could represent a 
reasonable compromise, though my preference would be to deal with collections 
separately.

Maybe what I'm doing here is making a case for modularization of the provenance 
spec (ala PML?), rather lumping it all into one, er, collection.

...

Returning to your comment about blurry criteria, here are some that are not 
blurry (though they are also unsubstantiated, but there are some clues at 
http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/):

* I think that if we can produce of base provenance ontology of <=8 classes <=12 
properties, we stand a chance of deployment at the scale of FOAF (the numbers 
are approximately the size of FOAF core - http://xmlns.com/foaf/spec/)

* I think a base ontology with twice the number of classes could achieve less 
than 10% of the adoption of FOAF (e.g. compare interest in vCard vs FOAF or DC 
at 
http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/

* I think a base ontology with substantially more terms will receive 
substantially less adoption.

The numbers here are, to be sure, very unscientific.  But it's interesting that, 
not counting the "infrastructure" ontologies (rdf, rdfs, owl, ex), all the "high 
interest" ontologies that I probes were also relatively small (up to 40 terms 
overall at a rough guess)

On this basis, my criterion becomes very un-blurry: fewer terms is better by far.

Of course, there's a balance to be struck, but it brings home to me that each 
term that is added to the overall provenance ontology has to bring substantial 
benefit if the adoption (impact) of our work is not to be reduced.

...

Finally, the reason I think that PROV *could* be as popular as FOAF is because 
it is positioned to underpin a key missing feature of the web - providing a 
machine actionable basis for dealing with conflicting information (trust, 
information quality assessment).  It could be, in a real sense, the FOAF of data 
("who are you?", "who do you know?", "where do you come from?", etc.).

As yet, we don't *know* what aspects of provenance will be important in this 
respect, though there is some research (including your own, Paolo) that suggests 
some directions.  So, in pursuit of this goal, the thing about PROV that matters 
almost more than anything else is scale of adoption.  So, on this view, 
*anything* that stands in the way of adoption without providing needed 
functionality that cannot be achived in any other way is arguably an impediment 
to the eventual success of PROV.

#g
--

> On 4/26/12 12:04 PM, Graham Klyne wrote:
>> I find myself somewhat concerned by what appears to be scope creep associated
>> with collections. It seems to me that in the area, the provenance model is
>> straying in the the domain of application design. If collections were just
>> sets, I could probably hold my nose and say nothing, but this talk of having
>> provenance define various forms of collection indexing seems to me to be out of
>> scope.
>>
>> So I think this is somewhat in agreement with what Satya says here, though I
>> remain unconvinced that the notions of collections and derivation-by-insertion,
>> etc., actually *need* to be in the main provenance ontology - why not let
>> individual applications define their own provenance extension terms?
>>
>> #g
>> --
>>
>> On 18/04/2012 17:35, Satya Sahoo wrote:
>>> Hi all,
>>> The issue I had raised last week is that collection is an important
>>> provenance construct, but the assumption of only key-value pair based
>>> collection is too narrow and the relations derivedByInsertionFrom,
>>> Derivation-by-Removal are over specifications that are not required.
>>>
>>> I have collected the following examples for collection, which only require
>>> the definition of the collection in DM5 (collection of entities) and they
>>> don't have (a) a key-value structure, and (b) derivedByInsertionFrom,
>>> derivedByRemovalFrom relations are not needed:
>>> 1. Cell line is a collection of cells used in many biomedical experiments.
>>> The provenance of the cell line (as a collection) include, who submitted
>>> the cell line, what method was used to authenticate the cell line, when was
>>> the given cell line contaminated? The provenance of the cells in a cell
>>> line include, what is the source of the cells (e.g. organism)?
>>>
>>> 2. A patient cohort is a collection of patients satisfying some constraints
>>> for a research study. The provenance of the cohort include, what
>>> eligibility criteria were used to identify the cohort, when was the cohort
>>> identified? The provenance of the patients in a cohort may include their
>>> health provider etc.
>>>
>>> Hope this helps our discussion.
>>>
>>> Thanks.
>>>
>>> Best,
>>> Satya
>>>
>>>
>>> On Thu, Apr 12, 2012 at 5:06 PM, Luc Moreau<L.Moreau@ecs.soton.ac.uk>wrote:
>>>
>>>> Hi Jun and Satya,
>>>>
>>>> Following today's call, ACTION-76 [1] and ACTION-77 [2] were raised
>>>> against you, as we agreed.
>>>>
>>>> Cheers,
>>>> Luc
>>>>
>>>> [1]
>>>> https://www.w3.org/2011/prov/**track/actions/76<https://www.w3.org/2011/prov/track/actions/76>
>>>>
>>>> [2]
>>>> https://www.w3.org/2011/prov/**track/actions/77<https://www.w3.org/2011/prov/track/actions/77>
>>>>
>>>>
>>>>
>
>
Received on Thursday, 26 April 2012 17:23:06 UTC