Re: Fwd: Going for simplicity (was: actions related to collections) from Paul Groth on 2012-05-01 (public-prov-wg@w3.org from May 2012)

From: Paul Groth <p.t.groth@vu.nl>
Date: Tue, 1 May 2012 09:55:48 +0200
To: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Cc: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <CAJCyKRqcgBA+ENXkyCU+kC3yT2V7fTUmXrwcQ7d0t7Ku=sHjcQ@mail.gmail.com>
Hi Graham,

I guess my response would be is that the model has simple starting
points and that I think with the proper organization we will be fine.
There is consensus on the constructs that we currently have and there
has been movement towards consolidation.

I think the key is explainability of the model. Where should a
developer start?  I would argue that we are close to that goal.  I
would disagree for example that the qualified pattern really should be
counted as all new concepts. Once one understands the simple pattern,
it is applied consistently.

But I guess we are hand waving a bit so one would need to be more specific. :-)

I think for example the proposal on quoter is one way to reduce the
size of the model. Also, the separation of collections into a distinct
document would be good as well.

Obviously, other suggestions are appreciated.

cheers
Paul




On Tue, May 1, 2012 at 5:30 AM, Graham Klyne <graham.klyne@zoo.ox.ac.uk> wrote:
> On 30/04/2012 11:28, Paul Groth wrote:
>> Hi All,
>>
>> In Graham's comments, he put a rough number on the number of concepts
>> namely 40. We are under that at 32 concepts this includes collections.
>
> I was going to stand back a while from this discussion, but I think I must
> correct this (which may have been my own error) - the figure of 40 is total
> number of terms, including properties, not just concepts.
>
> Actually, I think the ideal, for *really* large scale adoption, is under 20
> terms - concepts *and* properties.
>
> (But of course the numbers are a bit arbitrary - my main point is that I think
> we are quite a way over the level of complexity that is likely to achieve really
> large-scale deployment.)
>
>> Now in the ontology we have a bit more but this is because of the
>> involvement pattern, which I think actually doesn't increase
>> complexity as the pattern is systematic.
>
> But I fear that's what developers will see.
>
> #g
> --
>
>> As it stands, I think the model that the group has put together
>> strikes the right balance. We have a set of clear very small set of
>> starting points and then have some additional things that are pretty
>> core to provenance.
>>
>> I think there's an argument to be made to put collections/dictionary
>> in a separate document for readability purposes but I think they
>> should be part of the recommendation. There's been a lot of hard work
>> there and the agreement seems to be that they are useful.
>>
>> cheers
>> Paul
>>
>> On Mon, Apr 30, 2012 at 10:32 AM, Luc Moreau<L.Moreau@ecs.soton.ac.uk>  wrote:
>>> Hi Stian,
>>>
>>> Answer interleaved.
>>>
>>>
>>> From: Stian Soiland-Reyes<soiland-reyes@cs.manchester.ac.uk>
>>> Date: 29 April 2012 20:44:16 GMT+01:00
>>> To: Graham Klyne<graham.klyne@zoo.ox.ac.uk>
>>> Cc: Satya Sahoo<satya.sahoo@case.edu>,<public-prov-wg@w3.org>, Luc Moreau
>>> <L.Moreau@ecs.soton.ac.uk>, Paolo Missier<Paolo.Missier@ncl.ac.uk>
>>> Subject: Going for simplicity (was: actions related to collections)
>>>
>>>
>>> 5, Insightful.
>>>
>>> I agree on the general principle of simplicity. I had similar feelings when
>>> wasQuoteOf and friends moved in, but have now grown to like the few
>>> essential "real world" relations rather than having only a (easily verbose
>>> and not very rich) entity-activity-agent model.
>>>
>>> As you point out, a richer standard will also enable richer integration for
>>> fewer clients.
>>>
>>> One way towards having many adapters, some rich, is a simple core model, and
>>> additional buy-in modules. The core gets everyone hooked, the modules gives
>>> richness by giving a standard extension, "hey, you are thinking about
>>> collections in your prov, how about checking out this bit over here".
>>>
>>> But we need to make the essential modules. OPM suggested adapters to make
>>> profiles and extensions, but I don't know of many such extensions in real
>>> life. For instance DataOne is still working on agreeing how to do workflow
>>> provenance using OPM.
>>>
>>> Modules would also work as a kind of damage control. Let's say our view of
>>> attribution turned out to be very wrong for digital publishing, however, our
>>> view of derivation was a perfect fit. Adapters could choose to use PROV
>>> derivations and make their own, richer attribution model. With one massive
>>> model, we might easily put people off if one of our aspects are
>>> wrong/naive/difficult compared to a domain's view.
>>>
>>> I believe our current components in DM can form such a modularization.
>>> However I have not read any recommendation about how these can be used in
>>> such a pick-and-choose adaption, I thought they were merely rhetorical
>>> groupings to ease understanding. Luc?
>>>
>>>
>>> Yes, I saw components as a conceptual structuring of the data model, and not
>>> as a way of optionally selecting which bit of the model we want to use.
>>>
>>> There has been (so far!) no indication from the WG that we wanted to make
>>> some part of the model optional.  This can be considered of course.
>>>
>>> But to be effective, components need to be complementary. At the moment
>>> derivations and responsibility are still entangled.
>>> I don't think it's desirable.
>>>
>>>
>>>
>>> Is your suggestion that we for instance have /ns/prov# (core),
>>> /ns/prov-attribution# etc, or simply drop everything that is not "opmv
>>> like"? (My question: why not then use opmv?)
>>>
>>>
>>> I don't think we are keen to introduce multiple namespaces.
>>>
>>> Luc
>>>
>>> --
>>> Stian Soiland-Reyes, myGrid team
>>> School of Computer Science
>>> The University of Manchester
>>>
>>> On Apr 26, 2012 6:24 PM, "Graham Klyne"<graham.klyne@zoo.ox.ac.uk>  wrote:
>>>>
>>>> On 26/04/2012 13:39, Paolo Missier wrote:
>>>>>
>>>>> Graham
>>>>>
>>>>> you have made your point on this over and over again.
>>>>
>>>>
>>>> Yes, I've said it before, but I think not (in this context) so much to
>>>> count as "over and over again".  (Previously, I've objected to using
>>>> collections to model provenance accounts, which was a different matter.)
>>>>
>>>>> ... I think we get it, but I
>>>>> still don't see a strong argument. That is because the criteria used to
>>>>> define
>>>>> the scope here have been blurry and that has not improved with time.
>>>>> The comments that followed my own personal opinion on this (attached)
>>>>> seem to
>>>>> indicate that capturing the evolution of sets may be a good idea, given
>>>>> their
>>>>> pervasiveness. If this belongs to a specific domain, which domain is it?
>>>>
>>>>
>>>> Fair enough.  I'll see if I can substantiate my position...
>>>>
>>>> First, to be clear, I'm not saying that "capturing the evolution of sets"
>>>> is not a good idea.  What I question is the extent to which is *should* be
>>>> *entirely* down to the PROV spec to achieve this.
>>>>
>>>> We're defining a standard, and I think it's in the nature of standards for
>>>> use on the global Internet/Web that the criteria for defining scope are
>>>> blurry, because we can't expect to anticipate all of the ways in which they
>>>> will be used.
>>>>
>>>> For me, the acid test will be the extent of adoption.  In my experience,
>>>> it is the *simple* standards (of all kinds) that get more widely adopted.
>>>>   TCP/IP vs OSI.  SMTP vs X.400.  HTTP vs any number of content management
>>>> systems.
>>>>
>>>> I see the same for ontologies/vocabularies.  The widely used success
>>>> stories are ones like DC, FOAF, SIOC, SKOS, etc., which all have the
>>>> characteristic of focusing on a small set of core concepts.  Of course there
>>>> are more specialized large ontologies/vocabularies that have strong
>>>> following (e.g. a number of bioinformatics standards), but within much more
>>>> confined communities.  (TimBL has a slide about costs of ontology vs size of
>>>> community http://www.w3.org/2006/Talks/0314-ox-tbl/#(22) - it emphasizes the
>>>> benefits of widespread adoption, but doesn't address costs associated with
>>>> the *size* of the ontology.)
>>>>
>>>> In my view, provenance is something that /should/ be there with the likes
>>>> of DC and FOAF in terms of adoption.  Which for me prioritizes keeping it as
>>>> small as possible to maximize adoption.
>>>>
>>>> To repeat: I'm not saying that provenance of collections is not useful.
>>>>   I'm sure it is very useful in many situations.  For me the test is not so
>>>> much what is useful as what *needs* to be in the base provenance spec by
>>>> virtue of it cannot reasonably be retro-fitted via available extension
>>>> points.  What I have not seen is an explanation that the provenance of
>>>> collections cannot be handled through specialization of the core provenance
>>>> concepts we already have.  This might even be a separate *standard*.
>>>>
>>>> For me, all this is an an application of the principles of minimum power,
>>>> independent invention and modularity
>>>> (http://www.w3.org/DesignIssues/Principles.html).
>>>>
>>>> In many ways (and, to be clear, this is not a proposal, just an
>>>> illustration) I'd rather like to see something like OPMV go forward as a
>>>> base spec for provenance, because it's really clear from that what are the
>>>> key ideas, and has they tie together.
>>>>
>>>> Many of the things the group spends time discussing (including, but
>>>> limited to, collections) can be layered on this basic model.  The tension
>>>> here is that by specifying more in the base model, one achieves a greater
>>>> level of interoperability between systems *that fully implement the defined
>>>> model*, and at the same time decrease the number of systems that attempt to
>>>> implement the model.  This raises the question: is it more beneficial to
>>>> have a relative few systems implement a very rich model of provenance
>>>> interoperability, or to have very many systems implement a relatively weak
>>>> model?  And of course, it's not black-or-white ... there are reasonable
>>>> points between.   I think my view is clearly to "turn the dial" to the
>>>> simpler end of the spectrum but, of course, YMMV.
>>>>
>>>>> But I am sorry that you are having to hold your nose. Believe me, the
>>>>> provenance
>>>>> of a set doesn't smell that bad.
>>>>
>>>>
>>>> That was a figure of speech, and was probably an overly strong statement.
>>>>
>>>> As I say above, I'm sure provenance of collections of various kinds is
>>>> useful and important - what I'm really trying to push on is how much needs
>>>> to be in the base provenance specs that developers will have to master.
>>>>
>>>> I think I later in the discussion I saw a mention of abstract collections
>>>> that could be specialized in different ways.  That, for me, could represent
>>>> a reasonable compromise, though my preference would be to deal with
>>>> collections separately.
>>>>
>>>> Maybe what I'm doing here is making a case for modularization of the
>>>> provenance spec (ala PML?), rather lumping it all into one, er, collection.
>>>>
>>>> ...
>>>>
>>>> Returning to your comment about blurry criteria, here are some that are
>>>> not blurry (though they are also unsubstantiated, but there are some clues
>>>> at
>>>> http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/):
>>>>
>>>> * I think that if we can produce of base provenance ontology of<=8
>>>> classes<=12 properties, we stand a chance of deployment at the scale of
>>>> FOAF (the numbers are approximately the size of FOAF core -
>>>> http://xmlns.com/foaf/spec/)
>>>>
>>>> * I think a base ontology with twice the number of classes could achieve
>>>> less than 10% of the adoption of FOAF (e.g. compare interest in vCard vs
>>>> FOAF or DC at
>>>> http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/
>>>>
>>>> * I think a base ontology with substantially more terms will receive
>>>> substantially less adoption.
>>>>
>>>> The numbers here are, to be sure, very unscientific.  But it's interesting
>>>> that, not counting the "infrastructure" ontologies (rdf, rdfs, owl, ex), all
>>>> the "high interest" ontologies that I probes were also relatively small (up
>>>> to 40 terms overall at a rough guess)
>>>>
>>>> On this basis, my criterion becomes very un-blurry: fewer terms is better
>>>> by far.
>>>>
>>>> Of course, there's a balance to be struck, but it brings home to me that
>>>> each term that is added to the overall provenance ontology has to bring
>>>> substantial benefit if the adoption (impact) of our work is not to be
>>>> reduced.
>>>>
>>>> ...
>>>>
>>>> Finally, the reason I think that PROV *could* be as popular as FOAF is
>>>> because it is positioned to underpin a key missing feature of the web -
>>>> providing a machine actionable basis for dealing with conflicting
>>>> information (trust, information quality assessment).  It could be, in a real
>>>> sense, the FOAF of data ("who are you?", "who do you know?", "where do you
>>>> come from?", etc.).
>>>>
>>>> As yet, we don't *know* what aspects of provenance will be important in
>>>> this respect, though there is some research (including your own, Paolo) that
>>>> suggests some directions.  So, in pursuit of this goal, the thing about PROV
>>>> that matters almost more than anything else is scale of adoption.  So, on
>>>> this view, *anything* that stands in the way of adoption without providing
>>>> needed functionality that cannot be achived in any other way is arguably an
>>>> impediment to the eventual success of PROV.
>>>>
>>>> #g
>>>> --
>>>>
>>>>> On 4/26/12 12:04 PM, Graham Klyne wrote:
>>>>>>
>>>>>> I find myself somewhat concerned by what appears to be scope creep
>>>>>> associated
>>>>>> with collections. It seems to me that in the area, the provenance model
>>>>>> is
>>>>>> straying in the the domain of application design. If collections were
>>>>>> just
>>>>>> sets, I could probably hold my nose and say nothing, but this talk of
>>>>>> having
>>>>>> provenance define various forms of collection indexing seems to me to be
>>>>>> out of
>>>>>> scope.
>>>>>>
>>>>>> So I think this is somewhat in agreement with what Satya says here,
>>>>>> though I
>>>>>> remain unconvinced that the notions of collections and
>>>>>> derivation-by-insertion,
>>>>>> etc., actually *need* to be in the main provenance ontology - why not
>>>>>> let
>>>>>> individual applications define their own provenance extension terms?
>>>>>>
>>>>>> #g
>>>>>> --
>>>>>>
>>>>>> On 18/04/2012 17:35, Satya Sahoo wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>> The issue I had raised last week is that collection is an important
>>>>>>> provenance construct, but the assumption of only key-value pair based
>>>>>>> collection is too narrow and the relations derivedByInsertionFrom,
>>>>>>> Derivation-by-Removal are over specifications that are not required.
>>>>>>>
>>>>>>> I have collected the following examples for collection, which only
>>>>>>> require
>>>>>>> the definition of the collection in DM5 (collection of entities) and
>>>>>>> they
>>>>>>> don't have (a) a key-value structure, and (b) derivedByInsertionFrom,
>>>>>>> derivedByRemovalFrom relations are not needed:
>>>>>>> 1. Cell line is a collection of cells used in many biomedical
>>>>>>> experiments.
>>>>>>> The provenance of the cell line (as a collection) include, who
>>>>>>> submitted
>>>>>>> the cell line, what method was used to authenticate the cell line, when
>>>>>>> was
>>>>>>> the given cell line contaminated? The provenance of the cells in a cell
>>>>>>> line include, what is the source of the cells (e.g. organism)?
>>>>>>>
>>>>>>> 2. A patient cohort is a collection of patients satisfying some
>>>>>>> constraints
>>>>>>> for a research study. The provenance of the cohort include, what
>>>>>>> eligibility criteria were used to identify the cohort, when was the
>>>>>>> cohort
>>>>>>> identified? The provenance of the patients in a cohort may include
>>>>>>> their
>>>>>>> health provider etc.
>>>>>>>
>>>>>>> Hope this helps our discussion.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Best,
>>>>>>> Satya
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 12, 2012 at 5:06 PM, Luc
>>>>>>> Moreau<L.Moreau@ecs.soton.ac.uk>wrote:
>>>>>>>
>>>>>>>> Hi Jun and Satya,
>>>>>>>>
>>>>>>>> Following today's call, ACTION-76 [1] and ACTION-77 [2] were raised
>>>>>>>> against you, as we agreed.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Luc
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>>> https://www.w3.org/2011/prov/**track/actions/76<https://www.w3.org/2011/prov/track/actions/76>
>>>>>>>>
>>>>>>>> [2]
>>>>>>>>
>>>>>>>> https://www.w3.org/2011/prov/**track/actions/77<https://www.w3.org/2011/prov/track/actions/77>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Professor Luc Moreau
>>> Electronics and Computer Science   tel:   +44 23 8059 4487
>>> University of Southampton          fax:   +44 23 8059 2865
>>> Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
>>> United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
>>
>>
>>
>



-- 
--
Dr. Paul Groth (p.t.groth@vu.nl)
http://www.few.vu.nl/~pgroth/
Assistant Professor
Knowledge Representation & Reasoning Group
Artificial Intelligence Section
Department of Computer Science
VU University Amsterdam
Received on Tuesday, 1 May 2012 07:56:18 UTC