Re: A proposal for reorganizing PROV materials from Paolo Missier on 2012-05-08 (public-prov-wg@w3.org from May 2012)

From: Paolo Missier <Paolo.Missier@ncl.ac.uk>
Date: Tue, 08 May 2012 16:03:50 +0100
To: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4FA935D6.50808@ncl.ac.uk>
Hi Graham,

I have a naive question on the W3C model: is there a notion of different "compliance levels" wrt a recommendation? this probably 
echoes Luc's earlier comment on your proposal -- it is unclear to me what the consequences are of cutting through the corpus of 
existing material in a particular way. Can an organization be partially compliant just by implementing the "core"? (this is 
genuinely a reflection of my ignorance!)

   In the specifics, two comments. I don't think that directing developers to the primer is an admission of failure. I have used it 
as the entry point for student for a number of local projects now and it did do a nice of job of preparing for the prescriptive 
language of the DM.
The second comment is that I wouldn't relegate PROV-N to the semantics docs. Developers need to be aware of PROV-N both to generate 
and consume provenance, regardless of the formal semantics (which most developers will probably ignore).

  But while I am happy that PROV goes beyond OPMV in many ways, I am also worried about some of the specific complications that we 
are introducing in the model, see for instance the ongoing discussion on the various wasStartedBy* relations. My concrete suggestion 
is that, if we decide that it is ok to keep these relations in all their subtlety, at the very least we need to offer a 
non-normative "pattern book" specifically targeted at developers who need to generate "correct" provenance. It should reflect and be 
consistent with the constraints but never mention them.   Thoughts?

-Paolo


On 5/8/12 1:20 PM, Graham Klyne wrote:
> On 06/05/2012 12:01, Paul Groth wrote:
>> It would really be good to get specific suggestions from you. What
>> should be cut? What should be changed?
> <TL:DR>
> For "normal" developers:
> 1. A simple structural core model/vocabulary for provenance, also identifying
> extension points
> 2. Common extension terms
> 3. Ontology (i.e. expressing provenance in RDF)
> 4. A simple guide for generating provenance information
>
> For advanced users of provenance:
> 5. Formal semantics (incorporating PROV-N)
> 6. An advanced guide for using and interpreting provenance
> </TL:DR>
>
> ...
>
> Paul, I've been thinking about your question, and will try to articulate here my
> thoughts.  They will be quite radical, and I don't really expect the group to
> accept them - but I hope they may trigger some useful reflection.  (Separating
> collections is a useful step, but I feel it's rather nibbling at the edge of the
> complexity problem rather than facing it head-on.)
>
> Before diving in, I think it's worth reviewing my motivation for this...
>
>
> At the heart of my position is the question:
>
>     "For provenance, what does success look like?"
>
> (a) Maybe it looks like this:  rich and fully worked out specifications which
> are shown to address a range of described use-cases, complete with a consistent
> underlying theory that can be used to construct useful proofs around provenance
> information, reviewed and accepted for standards-track publication in the W3C.
> Software implementations that capture and exploit this provenance information in
> all its richness, and peer reviewed papers showing how provenance information,
> if provided according to the specification, can be used to underpin a range of
> trust issues around data on the web.
>
> (b) Or maybe like this:  a compact easily-grasped structure that makes it easy
> for developers to attach available information to their published datasets with
> just a few extra lines of code.  So easy to understand and apply that it becomes
> the norm to provide for every published dataset on the web, so that provenance
> information about data becomes as ubiquitous as data on the web, as ubiquitous
> as FOAF information about people.
>
> I think we are pretty much on course for (a), which is a perfectly reasonable
> position, but for me the massive potential we have for real impact is (b), which
> I think will be much harder to achieve on the basis of the current specifications.
>
> (My following comments are based in part on my experience as a developer working
> with other complex ontologies (notably FRBR and CIDOC-CRM):  by isolating and
> clearly explaining the structural core, the whole ontology comes much easier to
> approach and utilize.)
>
>
> So what does it take to stand a chance of achieving (b)?  My thoughts:
>
> 1. Identify the simple, structural core of provenance and describe that in a
> normative self-contained document for developers, with sufficient rigor and
> detail that developers who follow the spec can consistently generate basic
> provenance information structures, and with enough simplicity that developers
> whose primary interest is not provenance *can* follow the spec.  This should be
> less than 20 terms overall (the current "starting point" consists of 13 terms;
> OPMV (http://open-biomed.sourceforge.net/opmv/ns.html) has 15).
>
> This structural core should also identify the intended extension points, and how
> to add the "epistemic" aspects of provenance.  (That's a term I've adopted for
> this purpose- meaning the vocabulary terms that convey specific knowledge in
> conjunction with the underlying provenance structure; e.g. the specific role of
> an agent in an activity, the author of a document.  Is there a more widely used
> term for this?)  The document at http://code.google.com/p/opmv/wiki/OPMVGuide2
> (esp. section 3) covers many of the relevant issues, including how to use common
> provenance-related vocabularies in concert with the structural core.
>
> (NOTE: I say "normative" here, because I think the approach of directing
> developers first to a non-normative primer is a kind of admission of failure,
> and still leaves a developer needing to master the normative documents if there
> are to be confident that their code is generating valid provenance information.)
>
> This could use information currently in the Primer (section 2, but not the stuff
> about specialization/alternative) and/or Ontology documents (section 3.1).
>
>
> 2. Introduce "epistemic" provenance concepts that deal with common specific
> requirements (e.g. collections, quotation, etc.), without formalization.  I
> would expect this to be organized as reference material, consisting of several
> optional and free-standing sub-sections (or even separate documents).  Examples
> of the kind of material might be
> http://code.google.com/p/opmv/wiki/GuideOfCommonModule,
> http://code.google.com/p/opmv/wiki/OPMVExtensionsDataCollections.
>
> This would cover the parts of the model corresponding to ""Expanded terms" and
> "Dictionary terms" in the ontology document, and maybe aspects of "Qualified
> terms" (see below).
>
>
> 3. Ontology - specific terms for representing provenance in RDF.  The current
> provenance document seems to me to be pretty well organized from a high-level
> view.  (My assumption is that any of the subsections of "expanded terms",
> "qualified terms" and "Dictionary terms" can be skipped by anyone who does not
> need access to the capabilities they provide.)
>
> I have not been involved in the discussions about qualified terms, and I am
> somewhat concerned by the level of complexity the introduce into the RDF model
> (22 additional classes and 26 properties).  I can only hope that most
> applications that generate provenance information do not have to be concerned
> with these.  (Looking at figure 2 in the ontology document, it seems to me that
> for many practical purposes the intent of these properties could be captured by
> properties applied directly to the Activity ... it seems there's a kind of
> "double reification" going on here with respect to the naive presentation of
> provenance via something like DC.  In practice, if I were developing an
> application around this model using RDF that had to work with data at any
> reasonable scale, I'd probably end up introducing such properties in any case
> for performance reasons - cf. http://code.google.com/p/milarq/).
>
>
> 4. Describe how to generate provenance information in very simple terms for
> developers who are not and do not what to be specialists in provenance
> information (e.g. think of a developer creating a web site using Drupal - we
> want it to be really easy for them to design provenance information into their
> system).
>
>
> 5. Formal semantics, including the formal definition of PROV-N upon which it is
> based.  This would include material from
> http://www.w3.org/2011/prov/wiki/FormalSemanticsWD3
>
>
> 6. Describe how to consume/interpret provenance information, in particular with
> reference to the formal semantics.  This would be aimed at more specialist users
> (and creators) of provenance information, and would address the subtleties such
> as specialization, alternative, etc.  Among other things, it would cover more
> formal aspects such as constraints, inferences, mappings from common patterns,
> mapping from subproperties of the basic structural properties, and other
> simplified ways of expressing information, to the qualified terms pattern, etc.
>    Much of the material currently in the DM "constraints" document might end up here.
>
> ...
>
> In summary:
>
> 1. A simple structural core model/vocabulary for provenance (Normative)
>      This should be the entry point, easy to read and absorb, for all users.
> 2. Common extension terms (Normative)
>      This should be structured more as a reference work,
>      so relevant parts are easily accessed and others can be ignored.
> 3. Ontology (i.e. expressing provenance in RDF) (Normative)
>      Pretty much as the current document.
> 4. A simple guide for generating provenance information (Informative)
>      This would contain primer material dealing with the core concepts.
>
> For most developers, the above would be all they need to know about.
>
> 5. Formal semantics (incorporating PROV-N) (Normative)
>      A dense, formal description of PROV-N syntax and model theoretic
>      formal semantics for a strict interpretation of the provenance model.
> 6. An advanced guide for using and interpreting provenance (Informative)
>      For advanced developers of provenance applications and/or theory,
>      exploring and explaining the more formal aspects of provenance and how
>      they might affect applications that use provenance.
>
> ...
>
> So those are my thoughts.  They involve a fairly radical reorganization of the
> material we have, but I don't think that they call for fundamental changes to
> the technical consensus, or for the creation significant new material.  Existing
> material may need sub-editing, heavily in places.
>
> #g
> --
>
>


-- 
-----------  ~oo~  --------------
Paolo Missier - Paolo.Missier@newcastle.ac.uk, pmissier@acm.org
School of Computing Science, Newcastle University,  UK
http://www.cs.ncl.ac.uk/people/Paolo.Missier
Received on Tuesday, 8 May 2012 15:04:32 UTC