Re: Fwd: Going for simplicity (was: actions related to collections) from Luc Moreau on 2012-04-30 (public-prov-wg@w3.org from April 2012)

From: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
Date: Mon, 30 Apr 2012 09:32:46 +0100
To: Provenance Working Group WG <public-prov-wg@w3.org>, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Message-ID: <EMEW3|a35d0ffdec82ce9d2024aa811d83837bo3Y9Wn08L.Moreau|ecs.soton.ac.uk|4F9E4E2E>
Hi Stian,

Answer interleaved.
>
>> *From:* Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk 
>> <mailto:soiland-reyes@cs.manchester.ac.uk>>
>> *Date:* 29 April 2012 20:44:16 GMT+01:00
>> *To:* Graham Klyne <graham.klyne@zoo.ox.ac.uk 
>> <mailto:graham.klyne@zoo.ox.ac.uk>>
>> *Cc:* Satya Sahoo <satya.sahoo@case.edu 
>> <mailto:satya.sahoo@case.edu>>, <public-prov-wg@w3.org 
>> <mailto:public-prov-wg@w3.org>>, Luc Moreau <L.Moreau@ecs.soton.ac.uk 
>> <mailto:L.Moreau@ecs.soton.ac.uk>>, Paolo Missier 
>> <Paolo.Missier@ncl.ac.uk <mailto:Paolo.Missier@ncl.ac.uk>>
>> *Subject:* *Going for simplicity (was: actions related to collections)*
>>
>>
>> 5, Insightful.
>>
>> I agree on the general principle of simplicity. I had similar 
>> feelings when wasQuoteOf and friends moved in, but have now grown to 
>> like the few essential "real world" relations rather than having only 
>> a (easily verbose and not very rich) entity-activity-agent model.
>>
>> As you point out, a richer standard will also enable richer 
>> integration for fewer clients.
>>
>> One way towards having many adapters, some rich, is a simple core 
>> model, and additional buy-in modules. The core gets everyone hooked, 
>> the modules gives richness by giving a standard extension, "hey, you 
>> are thinking about collections in your prov, how about checking out 
>> this bit over here".
>>
>> But we need to make the essential modules. OPM suggested adapters to 
>> make profiles and extensions, but I don't know of many such 
>> extensions in real life. For instance DataOne is still working on 
>> agreeing how to do workflow provenance using OPM.
>>
>> Modules would also work as a kind of damage control. Let's say our 
>> view of attribution turned out to be very wrong for digital 
>> publishing, however, our view of derivation was a perfect fit. 
>> Adapters could choose to use PROV derivations and make their own, 
>> richer attribution model. With one massive model, we might easily put 
>> people off if one of our aspects are wrong/naive/difficult compared 
>> to a domain's view.
>>
>> I believe our current components in DM can form such a 
>> modularization. However I have not read any recommendation about how 
>> these can be used in such a pick-and-choose adaption, I thought they 
>> were merely rhetorical groupings to ease understanding. Luc?
>>

Yes, I saw components as a conceptual structuring of the data model, and 
not as a way of optionally selecting which bit of the model we want to use.

There has been (so far!) no indication from the WG that we wanted to 
make some part of the model optional.  This can be considered of course.

But to be effective, components need to be complementary. At the moment 
derivations and responsibility are still entangled.
I don't think it's desirable.


>> Is your suggestion that we for instance have /ns/prov# (core), 
>> /ns/prov-attribution# etc, or simply drop everything that is not 
>> "opmv like"? (My question: why not then use opmv?)
>>

I don't think we are keen to introduce multiple namespaces.

Luc
>>
>> -- 
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>>
>> On Apr 26, 2012 6:24 PM, "Graham Klyne" <graham.klyne@zoo.ox.ac.uk 
>> <mailto:graham.klyne@zoo.ox.ac.uk>> wrote:
>>
>>     On 26/04/2012 13:39, Paolo Missier wrote:
>>
>>         Graham
>>
>>         you have made your point on this over and over again.
>>
>>
>>     Yes, I've said it before, but I think not (in this context) so
>>     much to count as "over and over again".  (Previously, I've
>>     objected to using collections to model provenance accounts, which
>>     was a different matter.)
>>
>>         ... I think we get it, but I
>>         still don't see a strong argument. That is because the
>>         criteria used to define
>>         the scope here have been blurry and that has not improved
>>         with time.
>>         The comments that followed my own personal opinion on this
>>         (attached) seem to
>>         indicate that capturing the evolution of sets may be a good
>>         idea, given their
>>         pervasiveness. If this belongs to a specific domain, which
>>         domain is it?
>>
>>
>>     Fair enough.  I'll see if I can substantiate my position...
>>
>>     First, to be clear, I'm not saying that "capturing the evolution
>>     of sets" is not a good idea.  What I question is the extent to
>>     which is *should* be *entirely* down to the PROV spec to achieve
>>     this.
>>
>>     We're defining a standard, and I think it's in the nature of
>>     standards for use on the global Internet/Web that the criteria
>>     for defining scope are blurry, because we can't expect to
>>     anticipate all of the ways in which they will be used.
>>
>>     For me, the acid test will be the extent of adoption.  In my
>>     experience, it is the *simple* standards (of all kinds) that get
>>     more widely adopted.  TCP/IP vs OSI.  SMTP vs X.400.  HTTP vs any
>>     number of content management systems.
>>
>>     I see the same for ontologies/vocabularies.  The widely used
>>     success stories are ones like DC, FOAF, SIOC, SKOS, etc., which
>>     all have the characteristic of focusing on a small set of core
>>     concepts.  Of course there are more specialized large
>>     ontologies/vocabularies that have strong following (e.g. a number
>>     of bioinformatics standards), but within much more confined
>>     communities.  (TimBL has a slide about costs of ontology vs size
>>     of community http://www.w3.org/2006/Talks/0314-ox-tbl/#(22)
>>     <http://www.w3.org/2006/Talks/0314-ox-tbl/#%2822%29> - it
>>     emphasizes the benefits of widespread adoption, but doesn't
>>     address costs associated with the *size* of the ontology.)
>>
>>     In my view, provenance is something that /should/ be there with
>>     the likes of DC and FOAF in terms of adoption.  Which for me
>>     prioritizes keeping it as small as possible to maximize adoption.
>>
>>     To repeat: I'm not saying that provenance of collections is not
>>     useful.  I'm sure it is very useful in many situations.  For me
>>     the test is not so much what is useful as what *needs* to be in
>>     the base provenance spec by virtue of it cannot reasonably be
>>     retro-fitted via available extension points.  What I have not
>>     seen is an explanation that the provenance of collections cannot
>>     be handled through specialization of the core provenance concepts
>>     we already have.  This might even be a separate *standard*.
>>
>>     For me, all this is an an application of the principles of
>>     minimum power, independent invention and modularity
>>     (http://www.w3.org/DesignIssues/Principles.html).
>>
>>     In many ways (and, to be clear, this is not a proposal, just an
>>     illustration) I'd rather like to see something like OPMV go
>>     forward as a base spec for provenance, because it's really clear
>>     from that what are the key ideas, and has they tie together.
>>
>>     Many of the things the group spends time discussing (including,
>>     but limited to, collections) can be layered on this basic model.
>>      The tension here is that by specifying more in the base model,
>>     one achieves a greater level of interoperability between systems
>>     *that fully implement the defined model*, and at the same time
>>     decrease the number of systems that attempt to implement the
>>     model.  This raises the question: is it more beneficial to have a
>>     relative few systems implement a very rich model of provenance
>>     interoperability, or to have very many systems implement a
>>     relatively weak model?  And of course, it's not black-or-white
>>     ... there are reasonable points between.   I think my view is
>>     clearly to "turn the dial" to the simpler end of the spectrum
>>     but, of course, YMMV.
>>
>>         But I am sorry that you are having to hold your nose. Believe
>>         me, the provenance
>>         of a set doesn't smell that bad.
>>
>>
>>     That was a figure of speech, and was probably an overly strong
>>     statement.
>>
>>     As I say above, I'm sure provenance of collections of various
>>     kinds is useful and important - what I'm really trying to push on
>>     is how much needs to be in the base provenance specs that
>>     developers will have to master.
>>
>>     I think I later in the discussion I saw a mention of abstract
>>     collections that could be specialized in different ways.  That,
>>     for me, could represent a reasonable compromise, though my
>>     preference would be to deal with collections separately.
>>
>>     Maybe what I'm doing here is making a case for modularization of
>>     the provenance spec (ala PML?), rather lumping it all into one,
>>     er, collection.
>>
>>     ...
>>
>>     Returning to your comment about blurry criteria, here are some
>>     that are not blurry (though they are also unsubstantiated, but
>>     there are some clues at
>>     http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/):
>>
>>     * I think that if we can produce of base provenance ontology of
>>     <=8 classes <=12 properties, we stand a chance of deployment at
>>     the scale of FOAF (the numbers are approximately the size of FOAF
>>     core - http://xmlns.com/foaf/spec/)
>>
>>     * I think a base ontology with twice the number of classes could
>>     achieve less than 10% of the adoption of FOAF (e.g. compare
>>     interest in vCard vs FOAF or DC at
>>     http://richard.cyganiak.de/blog/2011/02/top-100-most-popular-rdf-namespace-prefixes/
>>
>>     * I think a base ontology with substantially more terms will
>>     receive substantially less adoption.
>>
>>     The numbers here are, to be sure, very unscientific.  But it's
>>     interesting that, not counting the "infrastructure" ontologies
>>     (rdf, rdfs, owl, ex), all the "high interest" ontologies that I
>>     probes were also relatively small (up to 40 terms overall at a
>>     rough guess)
>>
>>     On this basis, my criterion becomes very un-blurry: fewer terms
>>     is better by far.
>>
>>     Of course, there's a balance to be struck, but it brings home to
>>     me that each term that is added to the overall provenance
>>     ontology has to bring substantial benefit if the adoption
>>     (impact) of our work is not to be reduced.
>>
>>     ...
>>
>>     Finally, the reason I think that PROV *could* be as popular as
>>     FOAF is because it is positioned to underpin a key missing
>>     feature of the web - providing a machine actionable basis for
>>     dealing with conflicting information (trust, information quality
>>     assessment).  It could be, in a real sense, the FOAF of data
>>     ("who are you?", "who do you know?", "where do you come from?",
>>     etc.).
>>
>>     As yet, we don't *know* what aspects of provenance will be
>>     important in this respect, though there is some research
>>     (including your own, Paolo) that suggests some directions.  So,
>>     in pursuit of this goal, the thing about PROV that matters almost
>>     more than anything else is scale of adoption.  So, on this view,
>>     *anything* that stands in the way of adoption without providing
>>     needed functionality that cannot be achived in any other way is
>>     arguably an impediment to the eventual success of PROV.
>>
>>     #g
>>     --
>>
>>         On 4/26/12 12:04 PM, Graham Klyne wrote:
>>
>>             I find myself somewhat concerned by what appears to be
>>             scope creep associated
>>             with collections. It seems to me that in the area, the
>>             provenance model is
>>             straying in the the domain of application design. If
>>             collections were just
>>             sets, I could probably hold my nose and say nothing, but
>>             this talk of having
>>             provenance define various forms of collection indexing
>>             seems to me to be out of
>>             scope.
>>
>>             So I think this is somewhat in agreement with what Satya
>>             says here, though I
>>             remain unconvinced that the notions of collections and
>>             derivation-by-insertion,
>>             etc., actually *need* to be in the main provenance
>>             ontology - why not let
>>             individual applications define their own provenance
>>             extension terms?
>>
>>             #g
>>             --
>>
>>             On 18/04/2012 17:35, Satya Sahoo wrote:
>>
>>                 Hi all,
>>                 The issue I had raised last week is that collection
>>                 is an important
>>                 provenance construct, but the assumption of only
>>                 key-value pair based
>>                 collection is too narrow and the relations
>>                 derivedByInsertionFrom,
>>                 Derivation-by-Removal are over specifications that
>>                 are not required.
>>
>>                 I have collected the following examples for
>>                 collection, which only require
>>                 the definition of the collection in DM5 (collection
>>                 of entities) and they
>>                 don't have (a) a key-value structure, and (b)
>>                 derivedByInsertionFrom,
>>                 derivedByRemovalFrom relations are not needed:
>>                 1. Cell line is a collection of cells used in many
>>                 biomedical experiments.
>>                 The provenance of the cell line (as a collection)
>>                 include, who submitted
>>                 the cell line, what method was used to authenticate
>>                 the cell line, when was
>>                 the given cell line contaminated? The provenance of
>>                 the cells in a cell
>>                 line include, what is the source of the cells (e.g.
>>                 organism)?
>>
>>                 2. A patient cohort is a collection of patients
>>                 satisfying some constraints
>>                 for a research study. The provenance of the cohort
>>                 include, what
>>                 eligibility criteria were used to identify the
>>                 cohort, when was the cohort
>>                 identified? The provenance of the patients in a
>>                 cohort may include their
>>                 health provider etc.
>>
>>                 Hope this helps our discussion.
>>
>>                 Thanks.
>>
>>                 Best,
>>                 Satya
>>
>>
>>                 On Thu, Apr 12, 2012 at 5:06 PM, Luc
>>                 Moreau<L.Moreau@ecs.soton.ac.uk
>>                 <mailto:L.Moreau@ecs.soton.ac.uk>>wrote:
>>
>>                     Hi Jun and Satya,
>>
>>                     Following today's call, ACTION-76 [1] and
>>                     ACTION-77 [2] were raised
>>                     against you, as we agreed.
>>
>>                     Cheers,
>>                     Luc
>>
>>                     [1]
>>                     https://www.w3.org/2011/prov/**track/actions/76<https://www.w3.org/2011/prov/track/actions/76>
>>
>>                     [2]
>>                     https://www.w3.org/2011/prov/**track/actions/77<https://www.w3.org/2011/prov/track/actions/77>
>>
>>
>>
>>
>>
>>

-- 
Professor Luc Moreau
Electronics and Computer Science   tel:   +44 23 8059 4487
University of Southampton          fax:   +44 23 8059 2865
Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
Received on Monday, 30 April 2012 08:33:21 UTC