A proposal for reorganizing PROV materials (was: Complexity/simplicty redux) from Graham Klyne on 2012-05-08 (public-prov-wg@w3.org from May 2012)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Tue, 08 May 2012 13:20:28 +0100
To: Paul Groth <p.t.groth@vu.nl>
CC: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4FA90F8C.1030405@zoo.ox.ac.uk>
On 06/05/2012 12:01, Paul Groth wrote:
> It would really be good to get specific suggestions from you. What
> should be cut? What should be changed?

<TL:DR>
For "normal" developers:
1. A simple structural core model/vocabulary for provenance, also identifying 
extension points
2. Common extension terms
3. Ontology (i.e. expressing provenance in RDF)
4. A simple guide for generating provenance information

For advanced users of provenance:
5. Formal semantics (incorporating PROV-N)
6. An advanced guide for using and interpreting provenance
</TL:DR>

...

Paul, I've been thinking about your question, and will try to articulate here my 
thoughts.  They will be quite radical, and I don't really expect the group to 
accept them - but I hope they may trigger some useful reflection.  (Separating 
collections is a useful step, but I feel it's rather nibbling at the edge of the 
complexity problem rather than facing it head-on.)

Before diving in, I think it's worth reviewing my motivation for this...


At the heart of my position is the question:

   "For provenance, what does success look like?"

(a) Maybe it looks like this:  rich and fully worked out specifications which 
are shown to address a range of described use-cases, complete with a consistent 
underlying theory that can be used to construct useful proofs around provenance 
information, reviewed and accepted for standards-track publication in the W3C. 
Software implementations that capture and exploit this provenance information in 
all its richness, and peer reviewed papers showing how provenance information, 
if provided according to the specification, can be used to underpin a range of 
trust issues around data on the web.

(b) Or maybe like this:  a compact easily-grasped structure that makes it easy 
for developers to attach available information to their published datasets with 
just a few extra lines of code.  So easy to understand and apply that it becomes 
the norm to provide for every published dataset on the web, so that provenance 
information about data becomes as ubiquitous as data on the web, as ubiquitous 
as FOAF information about people.

I think we are pretty much on course for (a), which is a perfectly reasonable 
position, but for me the massive potential we have for real impact is (b), which 
I think will be much harder to achieve on the basis of the current specifications.

(My following comments are based in part on my experience as a developer working 
with other complex ontologies (notably FRBR and CIDOC-CRM):  by isolating and 
clearly explaining the structural core, the whole ontology comes much easier to 
approach and utilize.)


So what does it take to stand a chance of achieving (b)?  My thoughts:

1. Identify the simple, structural core of provenance and describe that in a 
normative self-contained document for developers, with sufficient rigor and 
detail that developers who follow the spec can consistently generate basic 
provenance information structures, and with enough simplicity that developers 
whose primary interest is not provenance *can* follow the spec.  This should be 
less than 20 terms overall (the current "starting point" consists of 13 terms; 
OPMV (http://open-biomed.sourceforge.net/opmv/ns.html) has 15).

This structural core should also identify the intended extension points, and how 
to add the "epistemic" aspects of provenance.  (That's a term I've adopted for 
this purpose- meaning the vocabulary terms that convey specific knowledge in 
conjunction with the underlying provenance structure; e.g. the specific role of 
an agent in an activity, the author of a document.  Is there a more widely used 
term for this?)  The document at http://code.google.com/p/opmv/wiki/OPMVGuide2 
(esp. section 3) covers many of the relevant issues, including how to use common 
provenance-related vocabularies in concert with the structural core.

(NOTE: I say "normative" here, because I think the approach of directing 
developers first to a non-normative primer is a kind of admission of failure, 
and still leaves a developer needing to master the normative documents if there 
are to be confident that their code is generating valid provenance information.)

This could use information currently in the Primer (section 2, but not the stuff 
about specialization/alternative) and/or Ontology documents (section 3.1).


2. Introduce "epistemic" provenance concepts that deal with common specific 
requirements (e.g. collections, quotation, etc.), without formalization.  I 
would expect this to be organized as reference material, consisting of several 
optional and free-standing sub-sections (or even separate documents).  Examples 
of the kind of material might be 
http://code.google.com/p/opmv/wiki/GuideOfCommonModule, 
http://code.google.com/p/opmv/wiki/OPMVExtensionsDataCollections.

This would cover the parts of the model corresponding to ""Expanded terms" and 
"Dictionary terms" in the ontology document, and maybe aspects of "Qualified 
terms" (see below).


3. Ontology - specific terms for representing provenance in RDF.  The current 
provenance document seems to me to be pretty well organized from a high-level 
view.  (My assumption is that any of the subsections of "expanded terms", 
"qualified terms" and "Dictionary terms" can be skipped by anyone who does not 
need access to the capabilities they provide.)

I have not been involved in the discussions about qualified terms, and I am 
somewhat concerned by the level of complexity the introduce into the RDF model 
(22 additional classes and 26 properties).  I can only hope that most 
applications that generate provenance information do not have to be concerned 
with these.  (Looking at figure 2 in the ontology document, it seems to me that 
for many practical purposes the intent of these properties could be captured by 
properties applied directly to the Activity ... it seems there's a kind of 
"double reification" going on here with respect to the naive presentation of 
provenance via something like DC.  In practice, if I were developing an 
application around this model using RDF that had to work with data at any 
reasonable scale, I'd probably end up introducing such properties in any case 
for performance reasons - cf. http://code.google.com/p/milarq/).


4. Describe how to generate provenance information in very simple terms for 
developers who are not and do not what to be specialists in provenance 
information (e.g. think of a developer creating a web site using Drupal - we 
want it to be really easy for them to design provenance information into their 
system).


5. Formal semantics, including the formal definition of PROV-N upon which it is 
based.  This would include material from 
http://www.w3.org/2011/prov/wiki/FormalSemanticsWD3


6. Describe how to consume/interpret provenance information, in particular with 
reference to the formal semantics.  This would be aimed at more specialist users 
(and creators) of provenance information, and would address the subtleties such 
as specialization, alternative, etc.  Among other things, it would cover more 
formal aspects such as constraints, inferences, mappings from common patterns, 
mapping from subproperties of the basic structural properties, and other 
simplified ways of expressing information, to the qualified terms pattern, etc. 
  Much of the material currently in the DM "constraints" document might end up here.

...

In summary:

1. A simple structural core model/vocabulary for provenance (Normative)
    This should be the entry point, easy to read and absorb, for all users.
2. Common extension terms (Normative)
    This should be structured more as a reference work,
    so relevant parts are easily accessed and others can be ignored.
3. Ontology (i.e. expressing provenance in RDF) (Normative)
    Pretty much as the current document.
4. A simple guide for generating provenance information (Informative)
    This would contain primer material dealing with the core concepts.

For most developers, the above would be all they need to know about.

5. Formal semantics (incorporating PROV-N) (Normative)
    A dense, formal description of PROV-N syntax and model theoretic
    formal semantics for a strict interpretation of the provenance model.
6. An advanced guide for using and interpreting provenance (Informative)
    For advanced developers of provenance applications and/or theory,
    exploring and explaining the more formal aspects of provenance and how
    they might affect applications that use provenance.

...

So those are my thoughts.  They involve a fairly radical reorganization of the 
material we have, but I don't think that they call for fundamental changes to 
the technical consensus, or for the creation significant new material.  Existing 
material may need sub-editing, heavily in places.

#g
--
Received on Tuesday, 8 May 2012 12:21:13 UTC