Re: Review of PROV-DM (WD4) up to section 5

I now realize I spent all morning reviewing the WRONG DOCUMENT :(

I've now taken a quick look at 
http://dvcs.w3.org/hg/prov/raw-file/a5f7ff3d6b30/model/working-copy/towards-wd4.html 
- I think this does start to address some of the provenance complexity issues, 
but I also think many of the comments I made do still apply:

Section 2:  I think much of the material here could be in the core 
specification.  But it's much easier to follow than the previous material.  The 
diagram is less clear to me that the older diagram, but I think that's just a 
placeholder.  if the overview text is retained, I think it might be helpful to 
have the overview diagram first.

Section 3:  I still find the example not-very-helpful at this point.  It uses 
ASM expressions before they hjave been defined.  I'd suggest having it as an 
appendix.  I find the process vs authors view approach is confusing.

Section 4:  many of my previous comments (to previous section 5) are addressed 
here, but I still think Note/annotations is superfluous, and derivation is 
over-complicated.   I'm not seeing the syntax distinguished symbol production 
(that used to be provenanceContainer).  I think several of my previous comments 
about identifiers attributes and qualified names still apply.

Out of time - need to join telecon now.

#g
--


On 23/02/2012 13:16, Graham Klyne wrote:
> Reviewing:
> http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html
>
> Summary: I'm sorry to say that I don't think the document even starts to bring
> in the kind of simplification discussed at the F2F meeting, which is required if
> this spec is to gain traction with web developers.
>
> I find the document is still difficult to read, and in a full morning of
> reviewing it I've only got as far as section 5. I think further *radical*
> simplification is required for the data model description, and I think it's
> possible without losing any essential information about the model.
>
> ...
>
> (Nit: when I load this document from a local copy of the repository, I get an
> error reported indicating a problem with fetching the CSS. It loads OK from the
> above URI. Is there a problematic relative URI reference in the source document?)
>
> ...
>
> I thought we'd agreed at F2F to provide a simple "scruffy" introduction to the
> DM (part 1), then introduce the requirement and refinements for more formally
> tractable provenance expressions that can be used to build accurate historical
> records over multiple related artifacts (part 2). The document I'm reading does
> very little that I can see to make the prov-dm more approachable, as was
> indicated that we need to do at the F2F. As far as I can tell, the only thing
> that has been in this direction is to *add* a new section on interpretation.
> This, of itself, does nothing to simplify the DM description.
>
> I think we should be placing far more emphasis on making it a simple as we
> possibly can for information providers to publish provenance. Consider that the
> primary beneficiaries of provenance information are the *consumers* of published
> information, not the *publishers*, so if we make life unnecessarily hard for
> publishers we're shooting ourselves in the collective foot. From this, I think
> the initial introduction to the DM needs to be radically simplified to the
> extent that a developer can spend 10-15 minutes glancing at it and think "oh
> yes, I can easily add this to my output data". If necessary, we push some of the
> work of understanding what needs to be done to harmonize the data to make it
> more suitable for building a historical record towards the consumer.
>
> ...
>
> With this in mind:
>
> Section 2:
>
> The introductory material in section 2.1 is unhelpful, and I propose it be
> removed from the introduction. Most of this material is not important until we
> come to consider the more formal aspects of the DM. With the exception of
> 2.1.2.1 about events, which I think should be introduced in the PROV-DM core
> model section. Similarly sections 2.2 and 2.3 (maybe moving the two introductory
> sentences of 2.2 into section 2.4). Thus section 2 would become just a very
> brief intro to the notation used for describing ASN, and maybe this could be
> moved into the PROV-DM core section (sect 5).
>
> Section 3 looks generally useful. But it still mentions an "account record",
> which I understood was being dropped. It also mentions "alternateOf" and
> "specializationOf" which are not necessary for a "scruffy" introduction to
> provenance, so I suggest mention of these is dropped from here. I suggest
> dropping the sentence about core and common relations - it's just noise. With
> the removal of accounts, I think the whole purpose of notes/annotation records
> *as part of the provenance model* has become moot, and suggest that these be
> dropped from the spec. There's nothing to prevent annotations being added to the
> provenance data as rdfs:comment or rdfs:label values. I suggest dropping the
> mention of extensibility points: again, it's just noise at this point.
>
> Section 4: to my mind, this example section adds no useful information and
> doesn't help understanding of the (on account of being harder to follow than the
> ASN model description), and suggest that it be dropped. Alternatively, I suggest
> moving it to an appendix.
>
> Section 5: this is the vital core of this document. Section 3 provides a very
> useful high-level overview, so this section can just get down to describing the
> constructs.
>
> I note that ASN is mis-named: it's not really an *abstract* syntax notation;
> it's quite concrete, so it's more like a (technology-neutral) functional syntax
> notion. @@raise separate issue for this?
>
> Section 5.1: prov-dm is a data model, not an implementation, right? So why do we
> need to introduce "housekeeping constructs ... to facilitate their interchange"?
> Suggest dropping most of the discussion of "record container", and simply
> introduce the "recordContainer" and "namespaceDeclaration" productions along
> with production for "record".
>
>
> Section 5.2.1: Entity record
>
> Suggest drop "In PROV-DM, " - it's redundant.
>
> Suggest the examples focus more on web documents, with "car" as more of an
> afterthought. Primary use will probably be to describe web documents, sop lets
> keep this at front-of-mind?
>
> Suggest dropping all mentions of "asserters viewpoint" and "situation in the
> world" - these don't matter for the "scruffy" view of provenance.
>
> Suggest dropping the idea that the attributes somehow define the entity ("whose
> situation in the world is represented by the attribute-value pairs"). They're
> just there to provide information about the entity, and as hooks for
> interoperability. (I argued previously for dropping attributes completely, but
> was persuaded otherwise by the interoperability argument from the provenance
> challenges - don't try to make more of them.)
>
> Suggest drop issue mentioning "characterization interval" - I think it's now a
> non-issue.
>
> I think the issue of uniqueness of identifiers should be dealt with in the
> introduction to ASN, not under the individual elements.
>
> Under "further considerations", suggest dropping all but 3rd and 6th bullets. In
> the 6th bullet, I don't understand the stuff about "a namespace also declares
> the number of occurrences...". I have deep concern about what this might be
> trying to say. In any case, shouldn't this be covered under a description of the
> namespace, if needed?
>
> I think the material about "activities" and "plans" really doesn't belong in
> this section.
>
>
> Section 5.2.2 Activity record
>
> Suggest drop "In PROV-DM, " - it's redundant.
>
> Didn't we discuss replacing the start, end times by events? I don't recall the
> outcome - I'm just mentioning this in case it's been missed.
>
> For the example, I suggest leading on something to do with information on the web.
>
> It was a surprise to me to learn that PROV-DM has reserved attributes. If
> attributes are in the model to support interoperability with other provenance
> frameworks (which is my understanding from previous discussions), this feels
> like a poor design choice. Maybe it should be a separate parameter? In any case,
> I think the intent of this "subtyping" needs to be explained.
>
> If this is to be a "scruffy" introduction, I think the reference to
> start-view-end is not needed here. In any case, the cross-reference is almost
> impossible to locate in a printed copy of the spec.
>
> I think the issue of uniqueness of identifiers should be dealt with in the
> introduction to ASN, not under the individual elements.
>
> Suggest dropping the "further considerations bullets."
>
> Did we not agree that activities *would* be allowable as entities (especially if
> entities are just stuff that can identified).?
>
>
> Section 5.2.3, Agent record
>
> Having introduced a framework for subtyping for activities, why not use the same
> approach for different types of agents ... especially considering that two major
> agent types are defined by reference to existing foaf definitions? I suggest not
> asserting the claim that the agent types are mutually exclusive.
>
> Suggest drop reference to "situation in the world".
>
> Suggest drop discussion of inferences of agent records - if needed, they should
> come later along with a more formal ("non-scruffy") treatment of the data model.
>
>
> Section 5.2.4, Note record
>
> I think this should be dropped from the data model. I don't see that it serves
> any needed *provenance* function. "extra information" can be added by
> format-specific extensions. As such, this record type only adds noise to the
> specification.
>
>
> Section 5.3.1.1 generation record
>
> I believe the ASN syntax given verges on being ambiguous, and is unnecessarily
> tricky to parse by a human or machine consumer; e.g. consider:
>
> wasGeneratedBy(a,b)
> wasGeneratedBy(a,b,)
>
> The presence of the trailing comma in the second example completely changes the
> parse tree productions associated with a and b. I think it would be much easier
> if ASN simply required a dummy activity identifier to be provided; i.e. don't
> make aidentifier optional. Indeed, rather than allowing optional identifiers
> anywhere in the ASN, one might use a placeholder (e.g. '_') for any unspecified
> identifier, which would make the overall syntax much more regular.
>
> Since the id is used only for annotations, I suggest dropping it (see section
> 5.2.4 comment above).
>
> If this is to be a "scruffy" introduction, I think the reference to
> generation-within-activity is not needed here. In any case, the cross-reference
> is almost impossible to locate in a printed copy of the spec. Suggest drop this.
>
> Similarly, suggest dropping the structural constraint here.
>
>
> Section 5.3.1.2 Usage record
>
> Suggest drop "In PROV-DM, " - it's redundant.
>
> Why is there an identifier for a usage record?
>
> Suggest lead with example of consuming a web resource.
>
> Suggest drop reference to annotation record (see above note about 5.2.4)
>
> Suggest drop reference to interpretation here
>
>
> Section 5.3.2.1 Association record
>
> Para 3: Suggest drop first sentence, and simplify; i.e. just say; "Activities
> may reflect the execution of a plan..."
>
> Para 4, there quite a bit of redundancy redundancy here. Suggest:
> [[
> A plan is the description of a set of actions or steps intended by one or more
> agents to achieve some goal. PROV-DM is not prescriptive about the nature of
> plans, their representation, the actions and steps they consist of, and their
> intended goals. A plan can be a workflow for a scientific experiment, a recipe
> for a cooking activity, or a list of instructions for a micro-processor
> execution. Plans are entities, which may have associated provenance. An activity
> may be associated with multiple plans, allowing for descriptions of activities
> initially associated with a plan, which was changed, on the fly, as the activity
> progresses. Plans can be successfully executed or they can fail. We expect
> applications to exploit PROV-DM extensibility mechanisms to capture the rich
> nature of plans and associations between activities and plans.
> ]]
>
> Para 5: I see no value in cross-referencing the responsibility record here.
> Suggest dropping this paragraph.
>
> Why is there an identifier for an association record?
>
>
> Section 5.3.2.2 Start and End records
>
> This seems to overlap with start, end parameters on an activity. It's not
> immediately clear how they play together.
>
> Should this record not describe an "event"? Then the id should identify the
> start/end event, not the record. cf. Issue 207.
>
> Identifiers should denote activities and agents, *not records*.
>
>
> Section 5.3.3.1 Responsibility record
>
> Suggest drop "To promote take-up... " and instead lead with a simple
> introduction of what the record describes.
>
> Para 3: It seems to me that the responsibility record should stand independently
> of any association record. Suggest drop "Given an activity association record...
> (...)"
>
> Why is there an identifier for an responsibility record?
>
>
> Section 5.3.3.2 Derivation record
>
> Suggest drop "In PROV-DM, "
>
> This whole section seems way to complicated. My understanding is that the
> "Common relations" section is intended to cover those useful short-cut
> expressions that can be expressed with less convenience in the core model. As
> such, I think the derivation record should be a "common" rather than a "core"
> relation.
>
> Aside from that, I really don't see the utility of all this stuff about precise
> and imprecise derivations. I think there is just one useful relation to define,
> roughly corresponding to "imprecise n-derivation record" here:
>
> - I note that the "imprecise 1-derivation record" and "imprecise n-derivation
> record" are not syntactically distingushable, so there's no point in discussing
> the difference.
>
> - the "precise 1-derivation record" can be expressed using an activity, usage
> and generation record: I'm not convinced this alternative syntax is really
> buying anything worthwhile.
>
> Suggest radical simplification along these lines, and move to section 6. Don't
> introduce all the formal stuff until a later section handling more formal
> treatments.
>
>
> Section 5.3.3.3 Alternate and Specialization records
>
> In considering a "scruffy" view of provenance, these relations aren't really
> needed. However, they do underpin a more formal treatment in the face of dynamic
> resources.
>
> I would give serious consideration to introducing these later, when the more
> formal treatment of dynamic resources is considered.
>
>
> Section 5.3.4. Annotation record
>
> I think this serves no needed purpose, and should be dropped. (See earlier
> comments for section 5.2.4.)
>
>
> Section 5.4.1 Account record
>
> I understood we'd agreed to drop this.
>
>
> Section 5.4.2 Record container
>
> I think this is mainly an artifact of the ASN syntax, and should be introduced
> more briefly in the introductory section 5.1 (see previous comments)
>
>
> Section 5.5.1 Attribute
>
> I think the "optional-attribute-value" productions covered in section 5.2.1
> (Entity) should be covered here since they apply to multiple record types.
>
> I would prefer to see attribute names presented as being IRIs in the data model,
> with the namespace-qualified CURIE syntax available as a convenience in the ASN
> presentation.
>
> I think the predefined attribute names should be dealt with in a separate
> section. I'm actually not convinced this is the best design choice for
> properties with DM-defined meaning, as opposed to (say) using separate record
> parameters, but that's more of a style issue than a fundamental objection.
>
> As indicated earlier, I think the whole discussion of derivation steps is too
> much detail, and I don't see the value, and would suggest dropping the
> prov:steps attribute.
>
> For attribute prov:label: why not just use rdfs:label?
>
>
> Section 5.5.2 Identifiers
>
> The text says they are *qualified* names, but in most of the example they are
> not. Also, some identifiers are described as having local scope: this is not
> compatible with using *qualified* names which are essentially IRIs.
>
> The text describes identifiers as denoting *records* (e.g. entity record) - I
> think this is wrong, and in any case is inconsistent with text elsewhere in the
> document. They should demote "entity", "activity", "agent", etc.
>
>
> Section 5.5.3 Literal
>
> "A PROV-DM Literal represents a value whose interpretation is outside the scope
> of PROV-DM." What a Terrible Failure... the whole point of languages introducing
> literals is precvisely that their interpretation *is* defined by the language.
> If not, they might as well be names.
>
> I think the intent is that their interpretation is defined by reference to the
> corresponding xsd datatype definition, or some other datatype definition, that
> is effectively incorporated by reference.
>
> I'd suggest that an interpretation of literals is provided by:
> - http://www.w3.org/TR/rdf-mt/#gddenot
> - http://www.w3.org/TR/rdf-mt/#DTYPEINTERP
>
> Section 5.5.4 Time
>
> No syntax production provided or indicated.
>
> I think it's unnecessary and inappropriate to indicate where time is used. It's
> just something to go wrong as the document evolves.
>
>
> Section 5.5.5 Asserter
>
> Do we really still need this (now accounts are gone). Suggest dropping.
>
>
> Section 5.5.6 Namespace
>
> I'd suggest covering this with the introduction of the record container syntax
> production
>
>
> Section 5.5.7 Location
>
> Do we have any explicit use of this? if not, I'd suggest dropping it.
>
> ...
>
> I'm out of time and stopping my review here. There's a general pattern here that
> I'd also apply to section 6.
>
> I'd then take section 7 and (probably) exp[and it into several sections ("Part
> 2") introducing and describing a more formal treatment of provenance that can be
> used to bridge from and refine the "scruffy" view to something that can be
> assembled and processed according to inferences that flow from the formal
> semantics. A key point to introduce here would be that it is possible to create
> provenance statements that cannot possibly satisfy the formal semantics, and to
> indicate what additional constraints and disciplines should be applied to ensure
> that they can (and hence to make the inferences that flow from those semantics
> valid).
>
> #g
> --
>
>

Received on Thursday, 23 February 2012 15:59:31 UTC