Re: Review of PROV-DM (WD4) up to section 5

Tracker, this is now ISSUE-274

On 23/02/2012 16:59, Graham Klyne wrote:
> I now realize I spent all morning reviewing the WRONG DOCUMENT :(
>
> I've now taken a quick look at 
> http://dvcs.w3.org/hg/prov/raw-file/a5f7ff3d6b30/model/working-copy/towards-wd4.html 
> - I think this does start to address some of the provenance complexity 
> issues, but I also think many of the comments I made do still apply:
>
> Section 2:  I think much of the material here could be in the core 
> specification.  But it's much easier to follow than the previous 
> material.  The diagram is less clear to me that the older diagram, but 
> I think that's just a placeholder.  if the overview text is retained, 
> I think it might be helpful to have the overview diagram first.
>
> Section 3:  I still find the example not-very-helpful at this point.  
> It uses ASM expressions before they hjave been defined.  I'd suggest 
> having it as an appendix.  I find the process vs authors view approach 
> is confusing.
>
> Section 4:  many of my previous comments (to previous section 5) are 
> addressed here, but I still think Note/annotations is superfluous, and 
> derivation is over-complicated.   I'm not seeing the syntax 
> distinguished symbol production (that used to be 
> provenanceContainer).  I think several of my previous comments about 
> identifiers attributes and qualified names still apply.
>
> Out of time - need to join telecon now.
>
> #g
> -- 
>
>
> On 23/02/2012 13:16, Graham Klyne wrote:
>> Reviewing:
>> http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html 
>>
>>
>> Summary: I'm sorry to say that I don't think the document even starts 
>> to bring
>> in the kind of simplification discussed at the F2F meeting, which is 
>> required if
>> this spec is to gain traction with web developers.
>>
>> I find the document is still difficult to read, and in a full morning of
>> reviewing it I've only got as far as section 5. I think further 
>> *radical*
>> simplification is required for the data model description, and I 
>> think it's
>> possible without losing any essential information about the model.
>>
>> ...
>>
>> (Nit: when I load this document from a local copy of the repository, 
>> I get an
>> error reported indicating a problem with fetching the CSS. It loads 
>> OK from the
>> above URI. Is there a problematic relative URI reference in the 
>> source document?)
>>
>> ...
>>
>> I thought we'd agreed at F2F to provide a simple "scruffy" 
>> introduction to the
>> DM (part 1), then introduce the requirement and refinements for more 
>> formally
>> tractable provenance expressions that can be used to build accurate 
>> historical
>> records over multiple related artifacts (part 2). The document I'm 
>> reading does
>> very little that I can see to make the prov-dm more approachable, as was
>> indicated that we need to do at the F2F. As far as I can tell, the 
>> only thing
>> that has been in this direction is to *add* a new section on 
>> interpretation.
>> This, of itself, does nothing to simplify the DM description.
>>
>> I think we should be placing far more emphasis on making it a simple 
>> as we
>> possibly can for information providers to publish provenance. 
>> Consider that the
>> primary beneficiaries of provenance information are the *consumers* 
>> of published
>> information, not the *publishers*, so if we make life unnecessarily 
>> hard for
>> publishers we're shooting ourselves in the collective foot. From 
>> this, I think
>> the initial introduction to the DM needs to be radically simplified 
>> to the
>> extent that a developer can spend 10-15 minutes glancing at it and 
>> think "oh
>> yes, I can easily add this to my output data". If necessary, we push 
>> some of the
>> work of understanding what needs to be done to harmonize the data to 
>> make it
>> more suitable for building a historical record towards the consumer.
>>
>> ...
>>
>> With this in mind:
>>
>> Section 2:
>>
>> The introductory material in section 2.1 is unhelpful, and I propose 
>> it be
>> removed from the introduction. Most of this material is not important 
>> until we
>> come to consider the more formal aspects of the DM. With the 
>> exception of
>> 2.1.2.1 about events, which I think should be introduced in the 
>> PROV-DM core
>> model section. Similarly sections 2.2 and 2.3 (maybe moving the two 
>> introductory
>> sentences of 2.2 into section 2.4). Thus section 2 would become just 
>> a very
>> brief intro to the notation used for describing ASN, and maybe this 
>> could be
>> moved into the PROV-DM core section (sect 5).
>>
>> Section 3 looks generally useful. But it still mentions an "account 
>> record",
>> which I understood was being dropped. It also mentions "alternateOf" and
>> "specializationOf" which are not necessary for a "scruffy" 
>> introduction to
>> provenance, so I suggest mention of these is dropped from here. I 
>> suggest
>> dropping the sentence about core and common relations - it's just 
>> noise. With
>> the removal of accounts, I think the whole purpose of 
>> notes/annotation records
>> *as part of the provenance model* has become moot, and suggest that 
>> these be
>> dropped from the spec. There's nothing to prevent annotations being 
>> added to the
>> provenance data as rdfs:comment or rdfs:label values. I suggest 
>> dropping the
>> mention of extensibility points: again, it's just noise at this point.
>>
>> Section 4: to my mind, this example section adds no useful 
>> information and
>> doesn't help understanding of the (on account of being harder to 
>> follow than the
>> ASN model description), and suggest that it be dropped. 
>> Alternatively, I suggest
>> moving it to an appendix.
>>
>> Section 5: this is the vital core of this document. Section 3 
>> provides a very
>> useful high-level overview, so this section can just get down to 
>> describing the
>> constructs.
>>
>> I note that ASN is mis-named: it's not really an *abstract* syntax 
>> notation;
>> it's quite concrete, so it's more like a (technology-neutral) 
>> functional syntax
>> notion. @@raise separate issue for this?
>>
>> Section 5.1: prov-dm is a data model, not an implementation, right? 
>> So why do we
>> need to introduce "housekeeping constructs ... to facilitate their 
>> interchange"?
>> Suggest dropping most of the discussion of "record container", and 
>> simply
>> introduce the "recordContainer" and "namespaceDeclaration" 
>> productions along
>> with production for "record".
>>
>>
>> Section 5.2.1: Entity record
>>
>> Suggest drop "In PROV-DM, " - it's redundant.
>>
>> Suggest the examples focus more on web documents, with "car" as more 
>> of an
>> afterthought. Primary use will probably be to describe web documents, 
>> sop lets
>> keep this at front-of-mind?
>>
>> Suggest dropping all mentions of "asserters viewpoint" and "situation 
>> in the
>> world" - these don't matter for the "scruffy" view of provenance.
>>
>> Suggest dropping the idea that the attributes somehow define the 
>> entity ("whose
>> situation in the world is represented by the attribute-value pairs"). 
>> They're
>> just there to provide information about the entity, and as hooks for
>> interoperability. (I argued previously for dropping attributes 
>> completely, but
>> was persuaded otherwise by the interoperability argument from the 
>> provenance
>> challenges - don't try to make more of them.)
>>
>> Suggest drop issue mentioning "characterization interval" - I think 
>> it's now a
>> non-issue.
>>
>> I think the issue of uniqueness of identifiers should be dealt with 
>> in the
>> introduction to ASN, not under the individual elements.
>>
>> Under "further considerations", suggest dropping all but 3rd and 6th 
>> bullets. In
>> the 6th bullet, I don't understand the stuff about "a namespace also 
>> declares
>> the number of occurrences...". I have deep concern about what this 
>> might be
>> trying to say. In any case, shouldn't this be covered under a 
>> description of the
>> namespace, if needed?
>>
>> I think the material about "activities" and "plans" really doesn't 
>> belong in
>> this section.
>>
>>
>> Section 5.2.2 Activity record
>>
>> Suggest drop "In PROV-DM, " - it's redundant.
>>
>> Didn't we discuss replacing the start, end times by events? I don't 
>> recall the
>> outcome - I'm just mentioning this in case it's been missed.
>>
>> For the example, I suggest leading on something to do with 
>> information on the web.
>>
>> It was a surprise to me to learn that PROV-DM has reserved 
>> attributes. If
>> attributes are in the model to support interoperability with other 
>> provenance
>> frameworks (which is my understanding from previous discussions), 
>> this feels
>> like a poor design choice. Maybe it should be a separate parameter? 
>> In any case,
>> I think the intent of this "subtyping" needs to be explained.
>>
>> If this is to be a "scruffy" introduction, I think the reference to
>> start-view-end is not needed here. In any case, the cross-reference 
>> is almost
>> impossible to locate in a printed copy of the spec.
>>
>> I think the issue of uniqueness of identifiers should be dealt with 
>> in the
>> introduction to ASN, not under the individual elements.
>>
>> Suggest dropping the "further considerations bullets."
>>
>> Did we not agree that activities *would* be allowable as entities 
>> (especially if
>> entities are just stuff that can identified).?
>>
>>
>> Section 5.2.3, Agent record
>>
>> Having introduced a framework for subtyping for activities, why not 
>> use the same
>> approach for different types of agents ... especially considering 
>> that two major
>> agent types are defined by reference to existing foaf definitions? I 
>> suggest not
>> asserting the claim that the agent types are mutually exclusive.
>>
>> Suggest drop reference to "situation in the world".
>>
>> Suggest drop discussion of inferences of agent records - if needed, 
>> they should
>> come later along with a more formal ("non-scruffy") treatment of the 
>> data model.
>>
>>
>> Section 5.2.4, Note record
>>
>> I think this should be dropped from the data model. I don't see that 
>> it serves
>> any needed *provenance* function. "extra information" can be added by
>> format-specific extensions. As such, this record type only adds noise 
>> to the
>> specification.
>>
>>
>> Section 5.3.1.1 generation record
>>
>> I believe the ASN syntax given verges on being ambiguous, and is 
>> unnecessarily
>> tricky to parse by a human or machine consumer; e.g. consider:
>>
>> wasGeneratedBy(a,b)
>> wasGeneratedBy(a,b,)
>>
>> The presence of the trailing comma in the second example completely 
>> changes the
>> parse tree productions associated with a and b. I think it would be 
>> much easier
>> if ASN simply required a dummy activity identifier to be provided; 
>> i.e. don't
>> make aidentifier optional. Indeed, rather than allowing optional 
>> identifiers
>> anywhere in the ASN, one might use a placeholder (e.g. '_') for any 
>> unspecified
>> identifier, which would make the overall syntax much more regular.
>>
>> Since the id is used only for annotations, I suggest dropping it (see 
>> section
>> 5.2.4 comment above).
>>
>> If this is to be a "scruffy" introduction, I think the reference to
>> generation-within-activity is not needed here. In any case, the 
>> cross-reference
>> is almost impossible to locate in a printed copy of the spec. Suggest 
>> drop this.
>>
>> Similarly, suggest dropping the structural constraint here.
>>
>>
>> Section 5.3.1.2 Usage record
>>
>> Suggest drop "In PROV-DM, " - it's redundant.
>>
>> Why is there an identifier for a usage record?
>>
>> Suggest lead with example of consuming a web resource.
>>
>> Suggest drop reference to annotation record (see above note about 5.2.4)
>>
>> Suggest drop reference to interpretation here
>>
>>
>> Section 5.3.2.1 Association record
>>
>> Para 3: Suggest drop first sentence, and simplify; i.e. just say; 
>> "Activities
>> may reflect the execution of a plan..."
>>
>> Para 4, there quite a bit of redundancy redundancy here. Suggest:
>> [[
>> A plan is the description of a set of actions or steps intended by 
>> one or more
>> agents to achieve some goal. PROV-DM is not prescriptive about the 
>> nature of
>> plans, their representation, the actions and steps they consist of, 
>> and their
>> intended goals. A plan can be a workflow for a scientific experiment, 
>> a recipe
>> for a cooking activity, or a list of instructions for a micro-processor
>> execution. Plans are entities, which may have associated provenance. 
>> An activity
>> may be associated with multiple plans, allowing for descriptions of 
>> activities
>> initially associated with a plan, which was changed, on the fly, as 
>> the activity
>> progresses. Plans can be successfully executed or they can fail. We 
>> expect
>> applications to exploit PROV-DM extensibility mechanisms to capture 
>> the rich
>> nature of plans and associations between activities and plans.
>> ]]
>>
>> Para 5: I see no value in cross-referencing the responsibility record 
>> here.
>> Suggest dropping this paragraph.
>>
>> Why is there an identifier for an association record?
>>
>>
>> Section 5.3.2.2 Start and End records
>>
>> This seems to overlap with start, end parameters on an activity. It's 
>> not
>> immediately clear how they play together.
>>
>> Should this record not describe an "event"? Then the id should 
>> identify the
>> start/end event, not the record. cf. Issue 207.
>>
>> Identifiers should denote activities and agents, *not records*.
>>
>>
>> Section 5.3.3.1 Responsibility record
>>
>> Suggest drop "To promote take-up... " and instead lead with a simple
>> introduction of what the record describes.
>>
>> Para 3: It seems to me that the responsibility record should stand 
>> independently
>> of any association record. Suggest drop "Given an activity 
>> association record...
>> (...)"
>>
>> Why is there an identifier for an responsibility record?
>>
>>
>> Section 5.3.3.2 Derivation record
>>
>> Suggest drop "In PROV-DM, "
>>
>> This whole section seems way to complicated. My understanding is that 
>> the
>> "Common relations" section is intended to cover those useful short-cut
>> expressions that can be expressed with less convenience in the core 
>> model. As
>> such, I think the derivation record should be a "common" rather than 
>> a "core"
>> relation.
>>
>> Aside from that, I really don't see the utility of all this stuff 
>> about precise
>> and imprecise derivations. I think there is just one useful relation 
>> to define,
>> roughly corresponding to "imprecise n-derivation record" here:
>>
>> - I note that the "imprecise 1-derivation record" and "imprecise 
>> n-derivation
>> record" are not syntactically distingushable, so there's no point in 
>> discussing
>> the difference.
>>
>> - the "precise 1-derivation record" can be expressed using an 
>> activity, usage
>> and generation record: I'm not convinced this alternative syntax is 
>> really
>> buying anything worthwhile.
>>
>> Suggest radical simplification along these lines, and move to section 
>> 6. Don't
>> introduce all the formal stuff until a later section handling more 
>> formal
>> treatments.
>>
>>
>> Section 5.3.3.3 Alternate and Specialization records
>>
>> In considering a "scruffy" view of provenance, these relations aren't 
>> really
>> needed. However, they do underpin a more formal treatment in the face 
>> of dynamic
>> resources.
>>
>> I would give serious consideration to introducing these later, when 
>> the more
>> formal treatment of dynamic resources is considered.
>>
>>
>> Section 5.3.4. Annotation record
>>
>> I think this serves no needed purpose, and should be dropped. (See 
>> earlier
>> comments for section 5.2.4.)
>>
>>
>> Section 5.4.1 Account record
>>
>> I understood we'd agreed to drop this.
>>
>>
>> Section 5.4.2 Record container
>>
>> I think this is mainly an artifact of the ASN syntax, and should be 
>> introduced
>> more briefly in the introductory section 5.1 (see previous comments)
>>
>>
>> Section 5.5.1 Attribute
>>
>> I think the "optional-attribute-value" productions covered in section 
>> 5.2.1
>> (Entity) should be covered here since they apply to multiple record 
>> types.
>>
>> I would prefer to see attribute names presented as being IRIs in the 
>> data model,
>> with the namespace-qualified CURIE syntax available as a convenience 
>> in the ASN
>> presentation.
>>
>> I think the predefined attribute names should be dealt with in a 
>> separate
>> section. I'm actually not convinced this is the best design choice for
>> properties with DM-defined meaning, as opposed to (say) using 
>> separate record
>> parameters, but that's more of a style issue than a fundamental 
>> objection.
>>
>> As indicated earlier, I think the whole discussion of derivation 
>> steps is too
>> much detail, and I don't see the value, and would suggest dropping the
>> prov:steps attribute.
>>
>> For attribute prov:label: why not just use rdfs:label?
>>
>>
>> Section 5.5.2 Identifiers
>>
>> The text says they are *qualified* names, but in most of the example 
>> they are
>> not. Also, some identifiers are described as having local scope: this 
>> is not
>> compatible with using *qualified* names which are essentially IRIs.
>>
>> The text describes identifiers as denoting *records* (e.g. entity 
>> record) - I
>> think this is wrong, and in any case is inconsistent with text 
>> elsewhere in the
>> document. They should demote "entity", "activity", "agent", etc.
>>
>>
>> Section 5.5.3 Literal
>>
>> "A PROV-DM Literal represents a value whose interpretation is outside 
>> the scope
>> of PROV-DM." What a Terrible Failure... the whole point of languages 
>> introducing
>> literals is precvisely that their interpretation *is* defined by the 
>> language.
>> If not, they might as well be names.
>>
>> I think the intent is that their interpretation is defined by 
>> reference to the
>> corresponding xsd datatype definition, or some other datatype 
>> definition, that
>> is effectively incorporated by reference.
>>
>> I'd suggest that an interpretation of literals is provided by:
>> - http://www.w3.org/TR/rdf-mt/#gddenot
>> - http://www.w3.org/TR/rdf-mt/#DTYPEINTERP
>>
>> Section 5.5.4 Time
>>
>> No syntax production provided or indicated.
>>
>> I think it's unnecessary and inappropriate to indicate where time is 
>> used. It's
>> just something to go wrong as the document evolves.
>>
>>
>> Section 5.5.5 Asserter
>>
>> Do we really still need this (now accounts are gone). Suggest dropping.
>>
>>
>> Section 5.5.6 Namespace
>>
>> I'd suggest covering this with the introduction of the record 
>> container syntax
>> production
>>
>>
>> Section 5.5.7 Location
>>
>> Do we have any explicit use of this? if not, I'd suggest dropping it.
>>
>> ...
>>
>> I'm out of time and stopping my review here. There's a general 
>> pattern here that
>> I'd also apply to section 6.
>>
>> I'd then take section 7 and (probably) exp[and it into several 
>> sections ("Part
>> 2") introducing and describing a more formal treatment of provenance 
>> that can be
>> used to bridge from and refine the "scruffy" view to something that 
>> can be
>> assembled and processed according to inferences that flow from the 
>> formal
>> semantics. A key point to introduce here would be that it is possible 
>> to create
>> provenance statements that cannot possibly satisfy the formal 
>> semantics, and to
>> indicate what additional constraints and disciplines should be 
>> applied to ensure
>> that they can (and hence to make the inferences that flow from those 
>> semantics
>> valid).
>>
>> #g
>> -- 
>>
>>
>

Received on Wednesday, 29 February 2012 05:25:19 UTC