Re: Review of PROV-DM (WD4) up to section 5 from Luc Moreau on 2012-02-23 (public-prov-wg@w3.org from February 2012)

From: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
Date: Thu, 23 Feb 2012 13:51:46 +0000
To: public-prov-wg@w3.org
Message-ID: <EMEW3|cfea733aac3d7367e2598b3a524ef29bo1MDpp08L.Moreau|ecs.soton.ac.uk|4F464472>
Hi Graham,

I am sorry, but I don't understand which document you have reviewed.
http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html
is WD3.

What needed to be reviewed is:
http://dvcs.w3.org/hg/prov/raw-file/default/model/working-copy/towards-wd4.html
http://dvcs.w3.org/hg/prov/raw-file/default/model/working-copy/prov-dm-constraints.html
http://dvcs.w3.org/hg/prov/raw-file/default/model/working-copy/prov-asn.html

as indicated on http://www.w3.org/2011/prov/wiki/ProvDMWorkingDraft4

Regards,
Luc



On 02/23/2012 01:16 PM, Graham Klyne wrote:
> Reviewing: 
> http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html 
>
>
> Summary: I'm sorry to say that I don't think the document even starts 
> to bring in the kind of simplification discussed at the F2F meeting, 
> which is required if this spec is to gain traction with web developers.
>
> I find the document is still difficult to read, and in a full morning 
> of reviewing it I've only got as far as section 5.  I think further 
> *radical* simplification is required for the data model description, 
> and I think it's possible without losing any essential information 
> about the model.
>
> ...
>
> (Nit: when I load this document from a local copy of the repository, I 
> get an error reported indicating a problem with fetching the CSS.   It 
> loads OK from the above URI.  Is there a problematic relative URI 
> reference in the source document?)
>
> ...
>
> I thought we'd agreed at F2F to provide a simple "scruffy" 
> introduction to the DM (part 1), then introduce the requirement and 
> refinements for more formally tractable provenance expressions that 
> can be used to build accurate historical records over multiple related 
> artifacts (part 2).   The document I'm reading does very little that I 
> can see to make the prov-dm more approachable, as was indicated that 
> we need to do at the F2F.  As far as I can tell, the only thing that 
> has been in this direction is to *add* a new section on 
> interpretation. This, of itself, does nothing to simplify the DM 
> description.
>
> I think we should be placing far more emphasis on making it a simple 
> as we possibly can for information providers to publish provenance.  
> Consider that the primary beneficiaries of provenance information are 
> the *consumers* of published information, not the *publishers*, so if 
> we make life unnecessarily hard for publishers we're shooting 
> ourselves in the collective foot.  From this, I think the initial 
> introduction to the DM needs to be radically simplified to the extent 
> that a developer can spend 10-15 minutes glancing at it and think "oh 
> yes, I can easily add this to my output data".  If necessary, we push 
> some of the work of understanding what needs to be done to harmonize 
> the data to make it more suitable for building a historical record 
> towards the consumer.
>
> ...
>
> With this in mind:
>
> Section 2:
>
> The introductory material in section 2.1 is unhelpful, and I propose 
> it be removed from the introduction.  Most of this material is not 
> important until we come to consider the more formal aspects of the 
> DM.  With the exception of 2.1.2.1 about events, which I think should 
> be introduced in the PROV-DM core model section.  Similarly sections 
> 2.2 and 2.3 (maybe moving the two introductory sentences of 2.2 into 
> section 2.4).  Thus section 2 would become just a very brief intro to 
> the notation used for describing ASN, and maybe this could be moved 
> into the PROV-DM core section (sect 5).
>
> Section 3 looks generally useful.  But it still mentions an "account 
> record", which I understood was being dropped.  It also mentions 
> "alternateOf" and "specializationOf" which are not necessary for a 
> "scruffy" introduction to provenance, so I suggest mention of these is 
> dropped from here.  I suggest dropping the sentence about core and 
> common relations - it's just noise.  With the removal of accounts, I 
> think the whole purpose of notes/annotation records *as part of the 
> provenance model* has become moot, and suggest that these be dropped 
> from the spec.  There's nothing to prevent annotations being added to 
> the provenance data as rdfs:comment or rdfs:label values.  I suggest 
> dropping the mention of extensibility points: again, it's just noise 
> at this point.
>
> Section 4:  to my mind, this example section adds no useful 
> information and doesn't help understanding of the (on account of being 
> harder to follow than the ASN model description), and suggest that it 
> be dropped.  Alternatively, I suggest moving it to an appendix.
>
> Section 5: this is the vital core of this document.  Section 3 
> provides a very useful high-level overview, so this section can just 
> get down to describing the constructs.
>
> I note that ASN is mis-named: it's not really an *abstract* syntax 
> notation; it's quite concrete, so it's more like a 
> (technology-neutral) functional syntax notion.  @@raise separate issue 
> for this?
>
> Section 5.1:  prov-dm is a data model, not an implementation, right?  
> So why do we need to introduce "housekeeping constructs ... to 
> facilitate their interchange"?  Suggest dropping most of the 
> discussion of "record container", and simply introduce the 
> "recordContainer" and "namespaceDeclaration" productions along with 
> production for "record".
>
>
> Section 5.2.1:  Entity record
>
> Suggest drop "In PROV-DM, " - it's redundant.
>
> Suggest the examples focus more on web documents, with "car" as more 
> of an afterthought.  Primary use will probably be to describe web 
> documents, sop lets keep this at front-of-mind?
>
> Suggest dropping all mentions of "asserters viewpoint" and "situation 
> in the world" - these don't matter for the "scruffy" view of provenance.
>
> Suggest dropping the idea that the attributes somehow define the 
> entity ("whose situation in the world is represented by the 
> attribute-value pairs").  They're just there to provide information 
> about the entity, and as hooks for interoperability.  (I argued 
> previously for dropping attributes completely, but was persuaded 
> otherwise by the interoperability argument from the provenance 
> challenges - don't try to make more of them.)
>
> Suggest drop issue mentioning "characterization interval" - I think 
> it's now a non-issue.
>
> I think the issue of uniqueness of identifiers should be dealt with in 
> the introduction to ASN, not under the individual elements.
>
> Under "further considerations", suggest dropping all but 3rd and 6th 
> bullets. In the 6th bullet, I don't understand the stuff about "a 
> namespace also declares the number of occurrences...".  I have deep 
> concern about what this might be trying to say.  In any case, 
> shouldn't this be covered under a description of the namespace, if 
> needed?
>
> I think the material about "activities" and "plans" really doesn't 
> belong in this section.
>
>
> Section 5.2.2 Activity record
>
> Suggest drop "In PROV-DM, " - it's redundant.
>
> Didn't we discuss replacing the start, end times by events?  I don't 
> recall the outcome - I'm just mentioning this in case it's been missed.
>
> For the example, I suggest leading on something to do with information 
> on the web.
>
> It was a surprise to me to learn that PROV-DM has reserved 
> attributes.  If attributes are in the model to support 
> interoperability with other provenance frameworks (which is my 
> understanding from previous discussions), this feels like a poor 
> design choice.  Maybe it should be a separate parameter?  In any case, 
> I think the intent of this "subtyping" needs to be explained.
>
> If this is to be a "scruffy" introduction, I think the reference to 
> start-view-end is not needed here.  In any case, the cross-reference 
> is almost impossible to locate in a printed copy of the spec.
>
> I think the issue of uniqueness of identifiers should be dealt with in 
> the introduction to ASN, not under the individual elements.
>
> Suggest dropping the "further considerations bullets."
>
> Did we not agree that activities *would* be allowable as entities 
> (especially if entities are just stuff that can identified).?
>
>
> Section 5.2.3, Agent record
>
> Having introduced a framework for subtyping for activities, why not 
> use the same approach for different types of agents ... especially 
> considering that two major agent types are defined by reference to 
> existing foaf definitions?  I suggest not asserting the claim that the 
> agent types are mutually exclusive.
>
> Suggest drop reference to "situation in the world".
>
> Suggest drop discussion of inferences of agent records - if needed, 
> they should come later along with a more formal ("non-scruffy") 
> treatment of the data model.
>
>
> Section 5.2.4, Note record
>
> I think this should be dropped from the data model.  I don't see that 
> it serves any needed *provenance* function.  "extra information" can 
> be added by format-specific extensions.  As such, this record type 
> only adds noise to the specification.
>
>
> Section 5.3.1.1 generation  record
>
> I believe the ASN syntax given verges on being ambiguous, and is 
> unnecessarily tricky to parse by a human or machine consumer; e.g. 
> consider:
>
>   wasGeneratedBy(a,b)
>   wasGeneratedBy(a,b,)
>
> The presence of the trailing comma in the second example completely 
> changes the parse tree productions associated with a and b.  I think 
> it would be much easier if ASN simply required a dummy activity 
> identifier to be provided; i.e. don't make aidentifier optional.  
> Indeed, rather than allowing optional identifiers anywhere in the ASN, 
> one might use a placeholder (e.g. '_') for any unspecified identifier, 
> which would make the overall syntax much more regular.
>
> Since the id is used only for annotations, I suggest dropping it (see 
> section 5.2.4 comment above).
>
> If this is to be a "scruffy" introduction, I think the reference to 
> generation-within-activity is not needed here.  In any case, the 
> cross-reference is almost impossible to locate in a printed copy of 
> the spec.  Suggest drop this.
>
> Similarly, suggest dropping the structural constraint here.
>
>
> Section 5.3.1.2 Usage record
>
> Suggest drop "In PROV-DM, " - it's redundant.
>
> Why is there an identifier for a usage record?
>
> Suggest lead with example of consuming a web resource.
>
> Suggest drop reference to annotation record (see above note about 5.2.4)
>
> Suggest drop reference to interpretation here
>
>
> Section 5.3.2.1 Association record
>
> Para 3: Suggest drop first sentence, and simplify; i.e. just say; 
> "Activities may reflect the execution of a plan..."
>
> Para 4, there quite a bit of redundancy redundancy here.  Suggest:
> [[
> A plan is the description of a set of actions or steps intended by one 
> or more agents to achieve some goal. PROV-DM is not prescriptive about 
> the nature of plans, their representation, the actions and steps they 
> consist of, and their intended goals. A plan can be a workflow for a 
> scientific experiment, a recipe for a cooking activity, or a list of 
> instructions for a micro-processor execution. Plans are entities, 
> which may have associated provenance. An activity may be associated 
> with multiple plans, allowing for descriptions of activities initially 
> associated with a plan, which was changed, on the fly, as the activity 
> progresses. Plans can be successfully executed or they can fail. We 
> expect applications to exploit PROV-DM extensibility mechanisms to 
> capture the rich nature of plans and associations between activities 
> and plans.
> ]]
>
> Para 5: I see no value in cross-referencing the responsibility record 
> here. Suggest dropping this paragraph.
>
> Why is there an identifier for an association record?
>
>
> Section 5.3.2.2 Start and End records
>
> This seems to overlap with start, end parameters on an activity.   
> It's not immediately clear how they play together.
>
> Should this record not describe an "event"?  Then the id should 
> identify the start/end event, not the record.  cf. Issue 207.
>
> Identifiers should denote activities and agents, *not records*.
>
>
> Section 5.3.3.1 Responsibility record
>
> Suggest drop "To promote take-up... " and instead lead with a simple 
> introduction of what the record describes.
>
> Para 3: It seems to me that the responsibility record should stand 
> independently of any association record.  Suggest drop "Given an 
> activity association record... (...)"
>
> Why is there an identifier for an responsibility record?
>
>
> Section 5.3.3.2 Derivation record
>
> Suggest drop "In PROV-DM, "
>
> This whole section seems way to complicated.  My understanding is that 
> the "Common relations" section is intended to cover those useful 
> short-cut expressions that can be expressed with less convenience in 
> the core model.  As such, I think the derivation record should be a 
> "common" rather than a "core" relation.
>
> Aside from that, I really don't see the utility of all this stuff 
> about precise and imprecise derivations.  I think there is just one 
> useful relation to define, roughly corresponding to "imprecise 
> n-derivation record" here:
>
> - I note that the "imprecise 1-derivation record" and "imprecise 
> n-derivation record" are not syntactically distingushable, so there's 
> no point in discussing the difference.
>
> - the "precise 1-derivation record" can be expressed using an 
> activity, usage and generation record: I'm not convinced this 
> alternative syntax is really buying anything worthwhile.
>
> Suggest radical simplification along these lines, and move to section 
> 6.  Don't introduce all the formal stuff until a later section 
> handling more formal treatments.
>
>
> Section 5.3.3.3 Alternate and Specialization records
>
> In considering a "scruffy" view of provenance, these relations aren't 
> really needed.  However, they do underpin a more formal treatment in 
> the face of dynamic resources.
>
> I would give serious consideration to introducing these later, when 
> the more formal treatment of dynamic resources is considered.
>
>
> Section 5.3.4.  Annotation record
>
> I think this serves no needed purpose, and should be dropped.  (See 
> earlier comments for section 5.2.4.)
>
>
> Section 5.4.1 Account record
>
> I understood we'd agreed to drop this.
>
>
> Section 5.4.2 Record container
>
> I think this is mainly an artifact of the ASN syntax, and should be 
> introduced more briefly in the introductory section 5.1 (see previous 
> comments)
>
>
> Section 5.5.1 Attribute
>
> I think the "optional-attribute-value" productions covered in section 
> 5.2.1 (Entity) should be covered here since they apply to multiple 
> record types.
>
> I would prefer to see attribute names presented as being IRIs in the 
> data model, with the namespace-qualified CURIE syntax available as a 
> convenience in the ASN presentation.
>
> I think the predefined attribute names should be dealt with in a 
> separate section.  I'm actually not convinced this is the best design 
> choice for properties with DM-defined meaning, as opposed to (say) 
> using separate record parameters, but that's more of a style issue 
> than a fundamental objection.
>
> As indicated earlier, I think the whole discussion of derivation steps 
> is too much detail, and I don't see the value, and would suggest 
> dropping the prov:steps attribute.
>
> For attribute prov:label:  why not just use rdfs:label?
>
>
> Section 5.5.2 Identifiers
>
> The text says they are *qualified* names, but in most of the example 
> they are not.  Also, some identifiers are described as having local 
> scope: this is not compatible with using *qualified* names which are 
> essentially IRIs.
>
> The text describes identifiers as denoting *records* (e.g. entity 
> record) - I think this is wrong, and in any case is inconsistent with 
> text elsewhere in the document.  They should demote "entity", 
> "activity", "agent", etc.
>
>
> Section 5.5.3 Literal
>
> "A PROV-DM Literal represents a value whose interpretation is outside 
> the scope of PROV-DM."  What a Terrible Failure... the whole point of 
> languages introducing literals is precvisely that their interpretation 
> *is* defined by the language.  If not, they might as well be names.
>
> I think the intent is that their interpretation is defined by 
> reference to the corresponding xsd datatype definition, or some other 
> datatype definition, that is effectively incorporated by reference.
>
> I'd suggest that an interpretation of literals is provided by:
> - http://www.w3.org/TR/rdf-mt/#gddenot
> - http://www.w3.org/TR/rdf-mt/#DTYPEINTERP
>
> Section 5.5.4 Time
>
> No syntax production provided or indicated.
>
> I think it's unnecessary and inappropriate to indicate where time is 
> used.  It's just something to go wrong as the document evolves.
>
>
> Section 5.5.5 Asserter
>
> Do we really still need this (now accounts are gone).  Suggest dropping.
>
>
> Section 5.5.6 Namespace
>
> I'd suggest covering this with the introduction of the record 
> container syntax production
>
>
> Section 5.5.7 Location
>
> Do we have any explicit use of this?  if not, I'd suggest dropping it.
>
> ...
>
> I'm out of time and stopping my review here.  There's a general 
> pattern here that I'd also apply to section 6.
>
> I'd then take section 7 and (probably) exp[and it into several 
> sections ("Part 2") introducing and describing a more formal treatment 
> of provenance that can be used to bridge from and refine the "scruffy" 
> view to something that can be assembled and processed according to 
> inferences that flow from the formal semantics.  A key point to 
> introduce here would be that it is possible to create provenance 
> statements that cannot possibly satisfy the formal semantics, and to 
> indicate what additional constraints and disciplines should be applied 
> to ensure that they can (and hence to make the inferences that flow 
> from those semantics valid).
>
> #g
> -- 
>
>

-- 
Professor Luc Moreau
Electronics and Computer Science   tel:   +44 23 8059 4487
University of Southampton          fax:   +44 23 8059 2865
Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
United Kingdom                     http://www.ecs.soton.ac.uk/~lavm
Received on Thursday, 23 February 2012 13:52:23 UTC