- From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
- Date: Thu, 26 Apr 2012 18:20:52 +0100
- To: Paolo Missier <Paolo.Missier@ncl.ac.uk>
- CC: Satya Sahoo <satya.sahoo@case.edu>, Luc Moreau <L.Moreau@ecs.soton.ac.uk>, Provenance Working Group WG <public-prov-wg@w3.org>
On 26/04/2012 13:39, Paolo Missier wrote:
> Graham
>
> you have made your point on this over and over again.
Yes, I've said it before, but I think not (in this context) so much to count as
"over and over again". (Previously, I've objected to using collections to model
provenance accounts, which was a different matter.)
> ... I think we get it, but I
> still don't see a strong argument. That is because the criteria used to define
> the scope here have been blurry and that has not improved with time.
> The comments that followed my own personal opinion on this (attached) seem to
> indicate that capturing the evolution of sets may be a good idea, given their
> pervasiveness. If this belongs to a specific domain, which domain is it?
Fair enough. I'll see if I can substantiate my position...
First, to be clear, I'm not saying that "capturing the evolution of sets" is not
a good idea. What I question is the extent to which is *should* be *entirely*
down to the PROV spec to achieve this.
We're defining a standard, and I think it's in the nature of standards for use
on the global Internet/Web that the criteria for defining scope are blurry,
because we can't expect to anticipate all of the ways in which they will be used.
For me, the acid test will be the extent of adoption. In my experience, it is
the *simple* standards (of all kinds) that get more widely adopted. TCP/IP vs
OSI. SMTP vs X.400. HTTP vs any number of content management systems.
I see the same for ontologies/vocabularies. The widely used success stories are
ones like DC, FOAF, SIOC, SKOS, etc., which all have the characteristic of
focusing on a small set of core concepts. Of course there are more specialized
large ontologies/vocabularies that have strong following (e.g. a number of
bioinformatics standards), but within much more confined communities. (TimBL
has a slide about costs of ontology vs size of community
http://www.w3.org/2006/Talks/0314-ox-tbl/#(22) - it emphasizes the benefits of
widespread adoption, but doesn't address costs associated with the *size* of the
ontology.)
In my view, provenance is something that /should/ be there with the likes of DC
and FOAF in terms of adoption. Which for me prioritizes keeping it as small as
possible to maximize adoption.
To repeat: I'm not saying that provenance of collections is not useful. I'm
sure it is very useful in many situations. For me the test is not so much what
is useful as what *needs* to be in the base provenance spec by virtue of it
cannot reasonably be retro-fitted via available extension points. What I have
not seen is an explanation that the provenance of collections cannot be handled
through specialization of the core provenance concepts we already have. This
might even be a separate *standard*.
For me, all this is an an application of the principles of minimum power,
independent invention and modularity
(http://www.w3.org/DesignIssues/Principles.html).
In many ways (and, to be clear, this is not a proposal, just an illustration)
I'd rather like to see something like OPMV go forward as a base spec for
provenance, because it's really clear from that what are the key ideas, and has
they tie together.
Many of the things the group spends time discussing (including, but limited to,
collections) can be layered on this basic model. The tension here is that by
specifying more in the base model, one achieves a greater level of
interoperability between systems *that fully implement the defined model*, and
at the same time decrease the number of systems that attempt to implement the
model. This raises the question: is it more beneficial to have a relative few
systems implement a very rich model of provenance interoperability, or to have
very many systems implement a relatively weak model? And of course, it's not
black-or-white ... there are reasonable points between. I think my view is
clearly to "turn the dial" to the simpler end of the spectrum but, of course, YMMV.
> But I am sorry that you are having to hold your nose. Believe me, the provenance
> of a set doesn't smell that bad.
That was a figure of speech, and was probably an overly strong statement.
As I say above, I'm sure provenance of collections of various kinds is useful
and important - what I'm really trying to push on is how much needs to be in the
base provenance specs that developers will have to master.
I think I later in the discussion I saw a mention of abstract collections that
could be specialized in different ways. That, for me, could represent a
reasonable compromise, though my preference would be to deal with collections
separately.
Maybe what I'm doing here is making a case for modularization of the provenance
spec (ala PML?), rather lumping it all into one, er, collection.
...
Returning to your comment about blurry criteria, here are some that are not
blurry (though they are also unsubstantiated, but there are some clues at
http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/):
* I think that if we can produce of base provenance ontology of <=8 classes <=12
properties, we stand a chance of deployment at the scale of FOAF (the numbers
are approximately the size of FOAF core - http://xmlns.com/foaf/spec/)
* I think a base ontology with twice the number of classes could achieve less
than 10% of the adoption of FOAF (e.g. compare interest in vCard vs FOAF or DC
at
http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/
* I think a base ontology with substantially more terms will receive
substantially less adoption.
The numbers here are, to be sure, very unscientific. But it's interesting that,
not counting the "infrastructure" ontologies (rdf, rdfs, owl, ex), all the "high
interest" ontologies that I probes were also relatively small (up to 40 terms
overall at a rough guess)
On this basis, my criterion becomes very un-blurry: fewer terms is better by far.
Of course, there's a balance to be struck, but it brings home to me that each
term that is added to the overall provenance ontology has to bring substantial
benefit if the adoption (impact) of our work is not to be reduced.
...
Finally, the reason I think that PROV *could* be as popular as FOAF is because
it is positioned to underpin a key missing feature of the web - providing a
machine actionable basis for dealing with conflicting information (trust,
information quality assessment). It could be, in a real sense, the FOAF of data
("who are you?", "who do you know?", "where do you come from?", etc.).
As yet, we don't *know* what aspects of provenance will be important in this
respect, though there is some research (including your own, Paolo) that suggests
some directions. So, in pursuit of this goal, the thing about PROV that matters
almost more than anything else is scale of adoption. So, on this view,
*anything* that stands in the way of adoption without providing needed
functionality that cannot be achived in any other way is arguably an impediment
to the eventual success of PROV.
#g
--
> On 4/26/12 12:04 PM, Graham Klyne wrote:
>> I find myself somewhat concerned by what appears to be scope creep associated
>> with collections. It seems to me that in the area, the provenance model is
>> straying in the the domain of application design. If collections were just
>> sets, I could probably hold my nose and say nothing, but this talk of having
>> provenance define various forms of collection indexing seems to me to be out of
>> scope.
>>
>> So I think this is somewhat in agreement with what Satya says here, though I
>> remain unconvinced that the notions of collections and derivation-by-insertion,
>> etc., actually *need* to be in the main provenance ontology - why not let
>> individual applications define their own provenance extension terms?
>>
>> #g
>> --
>>
>> On 18/04/2012 17:35, Satya Sahoo wrote:
>>> Hi all,
>>> The issue I had raised last week is that collection is an important
>>> provenance construct, but the assumption of only key-value pair based
>>> collection is too narrow and the relations derivedByInsertionFrom,
>>> Derivation-by-Removal are over specifications that are not required.
>>>
>>> I have collected the following examples for collection, which only require
>>> the definition of the collection in DM5 (collection of entities) and they
>>> don't have (a) a key-value structure, and (b) derivedByInsertionFrom,
>>> derivedByRemovalFrom relations are not needed:
>>> 1. Cell line is a collection of cells used in many biomedical experiments.
>>> The provenance of the cell line (as a collection) include, who submitted
>>> the cell line, what method was used to authenticate the cell line, when was
>>> the given cell line contaminated? The provenance of the cells in a cell
>>> line include, what is the source of the cells (e.g. organism)?
>>>
>>> 2. A patient cohort is a collection of patients satisfying some constraints
>>> for a research study. The provenance of the cohort include, what
>>> eligibility criteria were used to identify the cohort, when was the cohort
>>> identified? The provenance of the patients in a cohort may include their
>>> health provider etc.
>>>
>>> Hope this helps our discussion.
>>>
>>> Thanks.
>>>
>>> Best,
>>> Satya
>>>
>>>
>>> On Thu, Apr 12, 2012 at 5:06 PM, Luc Moreau<L.Moreau@ecs.soton.ac.uk>wrote:
>>>
>>>> Hi Jun and Satya,
>>>>
>>>> Following today's call, ACTION-76 [1] and ACTION-77 [2] were raised
>>>> against you, as we agreed.
>>>>
>>>> Cheers,
>>>> Luc
>>>>
>>>> [1]
>>>> https://www.w3.org/2011/prov/**track/actions/76<https://www.w3.org/2011/prov/track/actions/76>
>>>>
>>>> [2]
>>>> https://www.w3.org/2011/prov/**track/actions/77<https://www.w3.org/2011/prov/track/actions/77>
>>>>
>>>>
>>>>
>
>
Received on Thursday, 26 April 2012 17:23:06 UTC