Re: A proposal for reorganizing PROV materials from Graham Klyne on 2012-05-08 (public-prov-wg@w3.org from May 2012)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Tue, 08 May 2012 19:56:23 +0100
To: Paolo Missier <Paolo.Missier@ncl.ac.uk>
CC: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4FA96C57.8020907@zoo.ox.ac.uk>
On 08/05/2012 16:03, Paolo Missier wrote:
> Hi Graham,
>
> I have a naive question on the W3C model: is there a notion of different
> "compliance levels" wrt a recommendation? this probably echoes Luc's earlier
> comment on your proposal -- it is unclear to me what the consequences are of
> cutting through the corpus of existing material in a particular way. Can an
> organization be partially compliant just by implementing the "core"? (this is
> genuinely a reflection of my ignorance!)

Hi Paolo,

To respond to your specific question:  as far as I'm aware, W3C specs do not 
generally define different compliance levels.  There are some occurrences e.g. 
the XML spec defines notions of syntactic correctness and validity, but the few 
cases I'm aware of are specification-specific, and not part of a general 
pattern.  Generally, I'd say that compliance levels are not regarded as a good 
thing, as they admit a basis for non-interoperability between "conforming 
implementations".

But returning to the present discussion:

I don't think my comments about core and non-core terms have anything to do with 
compliance levels.  It is entirely to do with presentation and accessibility of 
the material.  I don't think you'd claim that an application that does not 
generate every defined provenance term is non-conformant.  Similarly, consuming 
applications that choose to ignore some provenance properties are not thereby 
non-conformant, as long as they accept validly constructed provenance 
expressions and correctly interpret the terms they do use.  So presenting a set 
of central "core" terms on which others are more or less dependent isn't 
introducing a new level of conformity, it's just structuring the presentation of 
the concepts.

> In the specifics, two comments. I don't think that directing developers to the
> primer is an admission of failure. I have used it as the entry point for student
> for a number of local projects now and it did do a nice of job of preparing for
> the prescriptive language of the DM.

Well, I was speaking from personal experience of working with and using 
standards over several years, and your experience may well be different.  It's 
just that most of the standards I've used that I regarded as well-presented have 
not needed a primer.  Most of the IETF standards that underpin core Internet 
applications (SMTP, MIME, HTTP to name a few) don't have primers.  And they are 
quite easy for developers to to use and reference.

Also, for most of the developers we need to target to get really wide scale 
deployment, this will be a blip on the side of their main project, at the level 
of a chore to add a feature to satisfy some random feature requested by a 
marketing department.  A cookbook approach would probably be their first port of 
call, but (if they are remotely conscientious) they will want to be able to 
quickly cross-check that against a normative specification.  Having to read a 
primer to understand the normative specification doesn't help here.  Or a 
related scenario would be a marketing department asking "how much effort to add 
provenance to out output data".  To answer this, I would want to consult the 
normative spec for no more than 15-30 minutes (not a primer) and if that looks 
complicated the answer might be "lots".  Marketing department may prefer to fund 
some other "checkbox" feature that looks cheaper.

> The second comment is that I wouldn't relegate PROV-N to the semantics docs.
> Developers need to be aware of PROV-N both to generate and consume provenance,
> regardless of the formal semantics (which most developers will probably ignore).

I tentatively disagree.  My sense is that the main use of PROV-N in the 
specifications is to ground the more formal discussions.  (Part of my proposal 
was to move the discussion of constraints and inferences to being non-normative, 
as they should all follow from a normative formal semantics specification.)

I think the real large-scale use of provenance web will just use RDF, for which 
a developer can go straight to the ontology document (which I think is, so far 
as I've looked at it, rather well organized and presented) without being aware 
of PROV-N.  (A possible flaw in my approach here is if PROV-N is needed to 
understand a description of the core structural components - except that the 
ontology does a pretty good job of introducing them, IMO.)

However, I am willing to be convinced that PROV-N does serve a purpose in 
explaining the data model.  In the current document structure, it has been moved 
to a separate document from PROV-DM, so the data model description has to be 
somewhat free-standing, even though it does make some use of PROV-N constructs. 
  (I've commented elsewhere about this separation.)

> But while I am happy that PROV goes beyond OPMV in many ways, I am also worried
> about some of the specific complications that we are introducing in the model,
> see for instance the ongoing discussion on the various wasStartedBy* relations.
> My concrete suggestion is that, if we decide that it is ok to keep these
> relations in all their subtlety, at the very least we need to offer a
> non-normative "pattern book" specifically targeted at developers who need to
> generate "correct" provenance. It should reflect and be consistent with the
> constraints but never mention them. Thoughts?

I think I'd go along with that, as long as one doesn't need a non-normative 
pattern book to use the fundamentals.  "never mention" may be a bit extreme.  As 
far as I can tell, many of the constraints attempt to capture intuitions about 
the nature of provenance, and I think it's OK to discuss those intuitions in a 
cookbook.  Something I'd be wary of is that some of those intuitions may be 
misleading unless they are entailed by the formal semantics.

#g
--

> On 5/8/12 1:20 PM, Graham Klyne wrote:
>> On 06/05/2012 12:01, Paul Groth wrote:
>>> It would really be good to get specific suggestions from you. What
>>> should be cut? What should be changed?
>> <TL:DR>
>> For "normal" developers:
>> 1. A simple structural core model/vocabulary for provenance, also identifying
>> extension points
>> 2. Common extension terms
>> 3. Ontology (i.e. expressing provenance in RDF)
>> 4. A simple guide for generating provenance information
>>
>> For advanced users of provenance:
>> 5. Formal semantics (incorporating PROV-N)
>> 6. An advanced guide for using and interpreting provenance
>> </TL:DR>
>>
>> ...
>>
>> Paul, I've been thinking about your question, and will try to articulate here my
>> thoughts. They will be quite radical, and I don't really expect the group to
>> accept them - but I hope they may trigger some useful reflection. (Separating
>> collections is a useful step, but I feel it's rather nibbling at the edge of the
>> complexity problem rather than facing it head-on.)
>>
>> Before diving in, I think it's worth reviewing my motivation for this...
>>
>>
>> At the heart of my position is the question:
>>
>> "For provenance, what does success look like?"
>>
>> (a) Maybe it looks like this: rich and fully worked out specifications which
>> are shown to address a range of described use-cases, complete with a consistent
>> underlying theory that can be used to construct useful proofs around provenance
>> information, reviewed and accepted for standards-track publication in the W3C.
>> Software implementations that capture and exploit this provenance information in
>> all its richness, and peer reviewed papers showing how provenance information,
>> if provided according to the specification, can be used to underpin a range of
>> trust issues around data on the web.
>>
>> (b) Or maybe like this: a compact easily-grasped structure that makes it easy
>> for developers to attach available information to their published datasets with
>> just a few extra lines of code. So easy to understand and apply that it becomes
>> the norm to provide for every published dataset on the web, so that provenance
>> information about data becomes as ubiquitous as data on the web, as ubiquitous
>> as FOAF information about people.
>>
>> I think we are pretty much on course for (a), which is a perfectly reasonable
>> position, but for me the massive potential we have for real impact is (b), which
>> I think will be much harder to achieve on the basis of the current
>> specifications.
>>
>> (My following comments are based in part on my experience as a developer working
>> with other complex ontologies (notably FRBR and CIDOC-CRM): by isolating and
>> clearly explaining the structural core, the whole ontology comes much easier to
>> approach and utilize.)
>>
>>
>> So what does it take to stand a chance of achieving (b)? My thoughts:
>>
>> 1. Identify the simple, structural core of provenance and describe that in a
>> normative self-contained document for developers, with sufficient rigor and
>> detail that developers who follow the spec can consistently generate basic
>> provenance information structures, and with enough simplicity that developers
>> whose primary interest is not provenance *can* follow the spec. This should be
>> less than 20 terms overall (the current "starting point" consists of 13 terms;
>> OPMV (http://open-biomed.sourceforge.net/opmv/ns.html) has 15).
>>
>> This structural core should also identify the intended extension points, and how
>> to add the "epistemic" aspects of provenance. (That's a term I've adopted for
>> this purpose- meaning the vocabulary terms that convey specific knowledge in
>> conjunction with the underlying provenance structure; e.g. the specific role of
>> an agent in an activity, the author of a document. Is there a more widely used
>> term for this?) The document at http://code.google.com/p/opmv/wiki/OPMVGuide2
>> (esp. section 3) covers many of the relevant issues, including how to use common
>> provenance-related vocabularies in concert with the structural core.
>>
>> (NOTE: I say "normative" here, because I think the approach of directing
>> developers first to a non-normative primer is a kind of admission of failure,
>> and still leaves a developer needing to master the normative documents if there
>> are to be confident that their code is generating valid provenance information.)
>>
>> This could use information currently in the Primer (section 2, but not the stuff
>> about specialization/alternative) and/or Ontology documents (section 3.1).
>>
>>
>> 2. Introduce "epistemic" provenance concepts that deal with common specific
>> requirements (e.g. collections, quotation, etc.), without formalization. I
>> would expect this to be organized as reference material, consisting of several
>> optional and free-standing sub-sections (or even separate documents). Examples
>> of the kind of material might be
>> http://code.google.com/p/opmv/wiki/GuideOfCommonModule,
>> http://code.google.com/p/opmv/wiki/OPMVExtensionsDataCollections.
>>
>> This would cover the parts of the model corresponding to ""Expanded terms" and
>> "Dictionary terms" in the ontology document, and maybe aspects of "Qualified
>> terms" (see below).
>>
>>
>> 3. Ontology - specific terms for representing provenance in RDF. The current
>> provenance document seems to me to be pretty well organized from a high-level
>> view. (My assumption is that any of the subsections of "expanded terms",
>> "qualified terms" and "Dictionary terms" can be skipped by anyone who does not
>> need access to the capabilities they provide.)
>>
>> I have not been involved in the discussions about qualified terms, and I am
>> somewhat concerned by the level of complexity the introduce into the RDF model
>> (22 additional classes and 26 properties). I can only hope that most
>> applications that generate provenance information do not have to be concerned
>> with these. (Looking at figure 2 in the ontology document, it seems to me that
>> for many practical purposes the intent of these properties could be captured by
>> properties applied directly to the Activity ... it seems there's a kind of
>> "double reification" going on here with respect to the naive presentation of
>> provenance via something like DC. In practice, if I were developing an
>> application around this model using RDF that had to work with data at any
>> reasonable scale, I'd probably end up introducing such properties in any case
>> for performance reasons - cf. http://code.google.com/p/milarq/).
>>
>>
>> 4. Describe how to generate provenance information in very simple terms for
>> developers who are not and do not what to be specialists in provenance
>> information (e.g. think of a developer creating a web site using Drupal - we
>> want it to be really easy for them to design provenance information into their
>> system).
>>
>>
>> 5. Formal semantics, including the formal definition of PROV-N upon which it is
>> based. This would include material from
>> http://www.w3.org/2011/prov/wiki/FormalSemanticsWD3
>>
>>
>> 6. Describe how to consume/interpret provenance information, in particular with
>> reference to the formal semantics. This would be aimed at more specialist users
>> (and creators) of provenance information, and would address the subtleties such
>> as specialization, alternative, etc. Among other things, it would cover more
>> formal aspects such as constraints, inferences, mappings from common patterns,
>> mapping from subproperties of the basic structural properties, and other
>> simplified ways of expressing information, to the qualified terms pattern, etc.
>> Much of the material currently in the DM "constraints" document might end up
>> here.
>>
>> ...
>>
>> In summary:
>>
>> 1. A simple structural core model/vocabulary for provenance (Normative)
>> This should be the entry point, easy to read and absorb, for all users.
>> 2. Common extension terms (Normative)
>> This should be structured more as a reference work,
>> so relevant parts are easily accessed and others can be ignored.
>> 3. Ontology (i.e. expressing provenance in RDF) (Normative)
>> Pretty much as the current document.
>> 4. A simple guide for generating provenance information (Informative)
>> This would contain primer material dealing with the core concepts.
>>
>> For most developers, the above would be all they need to know about.
>>
>> 5. Formal semantics (incorporating PROV-N) (Normative)
>> A dense, formal description of PROV-N syntax and model theoretic
>> formal semantics for a strict interpretation of the provenance model.
>> 6. An advanced guide for using and interpreting provenance (Informative)
>> For advanced developers of provenance applications and/or theory,
>> exploring and explaining the more formal aspects of provenance and how
>> they might affect applications that use provenance.
>>
>> ...
>>
>> So those are my thoughts. They involve a fairly radical reorganization of the
>> material we have, but I don't think that they call for fundamental changes to
>> the technical consensus, or for the creation significant new material. Existing
>> material may need sub-editing, heavily in places.
>>
>> #g
>> --
>>
>>
>
>
Received on Tuesday, 8 May 2012 18:58:30 UTC