Re: PROV-ISSUE-331: feedback on PROV-Dm WD5 from Paul Groth on 2012-04-06 (public-prov-wg@w3.org from April 2012)

From: Paul Groth <p.t.groth@vu.nl>
Date: Fri, 6 Apr 2012 22:36:38 +0200
To: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Cc: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <CAJCyKRrBD0tDCy_yAAzfGUr5EjY3BvFbNSn8mHO-D_RmF4vV=Q@mail.gmail.com>
Hi Graham,

Just for clarification, given that you think prov-dm is not ready for
release, it's important to understand what exactly could be done to
get it to the point where it is.

Reading through your points, it seems to me that your comments are
primarily editorial, in that it's the explanation, definition and
organization of the terms that is the issue. Is that a correct
interpretation?

If not, can you identify the specific things that would need to be
addressed for us to move forward on prov-dm?

Regards
Paul


On Fri, Apr 6, 2012 at 9:51 PM, Graham Klyne <graham.klyne@zoo.ox.ac.uk> wrote:
> Re:
> http://dvcs.w3.org/hg/prov/raw-file/default/model/releases/ED-prov-dm-20120402/prov-dm.html
> (Retrieved on 2012-04-03)
>
> While this has many improvements over previous documents, I still feel that
> there are several respects in which the document does not really serve its
> intended purpose.
>
> Generally, I found the tone and phrasing were more akin to academic rhetoric,
> whose purpose is to persuade a peer of the truth of some proposition, than a
> technical standard whose aim should be to *specify*, *inform* and where
> necessary to *explain*.  Especially for developers who will have to use this
> material as a reference source.  Thus, I found much of what I read, particularly
> in the introductory section, had far to much justification (some of which was
> obvious, other aspects of which were just "noise") which didn't help to to
> understand what was being presented, or how to use it.
>
> I also still have problems with the overall organization.  In particular, I
> (still) find the example in section 3 breaks the hoped-for flow between the
> section 2 overview (which I also now think is mis-titled) and the provenance
> expression details in section 4.  I also don't think the final two subsections
> of section 2 belong there, as they deal with provenance expression details, not
> concepts.
>
> Finally, I found many examples of unusual or awkward phrasing which I found to
> be unhelpful, confusing or in some cases just plain wrong.
>
> To summarize: if we expect the next public working draft to be nearly ready for
> last, then I don't think this document is ready for release.
>
> Details follow.
>
> ...
>
>
> == Abstract ==
>
> The phrase "derivations between entities" is strange and confusing.  I think you
> mean something like "derivation of entities from other entities".
>
> "Properties that link entities that refer to a same thing".  I think this is
> just wrong:  I don't believe that entities *refer*.  I think you mean something
> like "Properties that link entities that are based on the same thing".
>
> "collections of entities, whose provenance itself can be tracked" - this feels
> vaguely ungrammatical, and I'm not quite sure what this is trying to express.
> In any case, I'll argue later that I don;t see why this is necessary as part of
> the provenance core model.  (What I'm not seeing here is anything I can
> recognize as the notion of accounts, which allow for provenance of provenance to
> be expressed.)
>
> Here, and later in the document, there are references to "natural language".  I
> believe this is a term of art that is meaningful only to those who have exposure
> to formal languages, as a way of distinguishing, and may be confusing to some
> readers.  In the abstract, I'd suggest just dropping this - the rest of the
> sentence carries the intended meaning.
>
> I'm not sure what you mean by "systematically defines".  Just "defines" would
> do, I think.
>
> == Status of this document ==
>
> The heading "how to read this document" is, I think, both patronizing and
> inaccurate.  And the following comments seem to significantly replicate the
> content of the preceding text.  I'd suggest moving descriptive material about
> the documents into the preceding text, and drop the stuff that tries to tell
> people what to read.
>
> "Fourth public working draft".  Really!!  Are we really up to 4 with this?  I
> lose count.
>
> == Introduction ==
>
> "how it should be integrated with other diverse information sources".  I find
> this phrase to be vague and unclear, and hence unhelpful.  I'd suggest dropping
> this, and changing "... help those users to make trust judgements" in the next
> sentence to read:
>
> "... help those users to decide which information to include in their analyses,
> and which to exclude."
>
> "The idea that ... a pragmatiuc approach is to consider ..." add's no useful
> value.  I suggest replacing all of this with "We consider ...".
>
> "the vision is that" is pure noise.  Suggest deleting this.  This whole
> paragraph seems to be an unnecessary repetition of what the previous says.
> While I sometimes think that a repeated summary can be useful, in this case I
> think it would be more helpful to simplify the preceding paragraph.
>
> The material that starts with "A set of specifications, ..." seems to be pure
> repetition of material contained in the "status of this document" - is it really
> necessary to repeat it here?
>
> The listing of "components!" seems to be greatly redundant.  Each component is
> both numbered (N) and introduced as "component N".  I think a simple numbered
> list without the "component N" tags would suffice.
>
> Two paragraphs starting with "This specification intentionally presents..." -
> these paragraphs are loaded with unnecessary self-justification.  I think a
> simpler statement along the lines of:
>
> "This specification presents the key concepts of the PROV data model and
> provenance expressions, without specific concern for how they are applied.  A
> companion document [PROV-DM-CONSTRAINTS] discusses some possible constraints on
> the application of this model, and corresponding useful inferences that may be
> available when those constraints are known to be satisfied."
>
> [[The next comment is rendered moot if the previous one is accepted...]]
> Paragraph: "However, if data changes...".  To an uninitiated reader, it is not
> at all clear what is meant by "data" here.  I'd suggest something like "If a
> thing about which provenance is expressed is subject to change, it is
> challenging to express its provenance precisely (e.g. the data from which a
> daily weather report is derived will change from day to day)."  Drop the
> reference to other metadata here - it adds nothing of value.
>
> @@(note to self) raise a separate issue about how to describe this "refinement".
>  I know I have argued for "refinement" over the idea of an "updated" or
> "modified" provenance model, but the term is still a bit vague.  I find myself
> leaning toward a notion of a "strict" interpretation of provenance that in turn
> allows certain inferences to be drawn if the supplied provenance satisfies
> certain strictness criteria (constraints).
>
> == 1.2 PROV namespace ==
>
> This section glibly introduces the notion of a "namespace" without explaining
> (or citing) what it means.
>
> "The PROV namespace is http://www.w3.org/prov#".  This is WRONG.
> http://www.w3.org/prov# is a URI, not a namespace (or, more precisely, it's a
> string that conforms to URI syntax).
>
> What should be said is something like: "The names for concepts, attributes and
> other reserved names introduced by this document belong to a namespace
> identified by the URI http://www.w3.org/prov#".
>
> And: what is the consequence of these names belonging to a namespace?  I think
> it would be appropriate to cite the corresponding XML and RDF documents that
> deal with namespace issues [1] [2].
>
> [1] http://www.w3.org/TR/REC-xml-names/
>
> [2] http://www.w3.org/TR/REC-rdf-syntax/ (sections 6.1.2, 6.1.4, etc.  These
> define how RDF/XML forms a URI-reference by appending a local name to a
> namespace URI.)
>
> == Section 2, PROV-DM staring points ==
>
> I think this section is mis-titled.
>
> I think it should be: "2. Introduction to provenance concepts", since that is
> what most of the section is about.
>
> In light of this, the final two sub-sections seem mis-placed, and I suggest they
> should be part of the early material in section 4.
>
> "... that a novice reader would write in a first instance".  Yuk!  How
> patronizing!  Also, a reference here to "natural language" (see previous).  I
> would phrase this whole paragraph thus:
>
> "This section introduces provenance concepts with informal descriptions and
> illustrative examples.  Later (section @@ref), we describe how these concepts
> are described using PROV-DM types and relations."
>
> (where @@ref should be in another section that actually deals with PROV-DM terms.)
>
> == 2.1 Entity and Activity ==
>
> "The term things encompasses..." - I find this phrasing awkward and potentially
> confusing - are we talking here about things or entities?  I suggest simply
> "These encompass ..."
>
> The final sentence is mostly noise.  Why not just "Any Web resource may be an
> entity."?
>
> "For the purpose of this specification..." is just noise.  Also, confusing
> reference to "entities" and "things".  Suggest for this para:  "An entity is a
> thing one wants to provide provenance for, which may be physical, digital,
> conceptual, or otherwise; entities may be real or imaginary."
>
> "This action can take multiple forms: ..." - this is confusing; are we talking
> about a single activity having multiple forms, or different activities having
> different forms.  I think you mean the latter, hence I suggest: "An activity is
> something that occurs over a period of time and acts upon or with entities. They
> may include consuming, processing, transforming, modifying, relocating, using,
> generating, or other associations with entities."
>
>
> == 2.2, et seq. ==
>
> I find similar issues with the wording of subsequent sections, but I haven't
> gone through every one for lack of time.  But I hope you get the general thrust
> from the above.
>
>
> == 2.3 Agents and other types of entities ==
>
> I think this exhibits poor organization of the material.  I think Agents and
> Plans are related, and suggest a sub-section for them.  Collections and accounts
> don't have any obvious relationship, and IMO should be separated.
>
> Concerning collections, it is not at all clear to me that these need to be in
> the core PROV-DM.  By including them here, you impose a particular view of
> collections that may not be appropriate  (somewhere, though I can't immediately
> find where, there is mention of a collection being a key-value map).  Domains
> that deal with collections have their own models for these, so why not let this
> be an aspect for domain-specific extension?
>
>
> I think accounts should have a section of their own, since they underpin the key
> feature of supporting provenance0-of-provenance.
>
> However, I have a problem with the description "An account is an entity that
> contains a bundle of provenance descriptions."  I think that this should be "An
> account *is* an entity that is a bundle of provenance descriptions."  That is, I
> don't think the core DM needs to or should expose the notion of containment,
> since that begs more questions.
>
> == 2.4 Attribution, association and responsibility ==
>
> I find the expression of these ideas to be hopelessly muddled, and incoherent.
> In particular, it seems to be self-contradictory with respect to the notion of
> "responsibility" (also with section 2.3):
>
> "An agent is a type of entity that bears some form of responsibility for an
> activity taking place."
> "Software for checking the use of grammar in a document may be defined as an agent"
> "Agents are defined as having some kind of responsibility for activities."
> "[an association may be] an XSLT transform launched by a user ..."
> "An activity association is an assignment of responsibility to an agent for an
> activity"
> "Responsibility is the fact that an agent is accountable for ..."
>
> At heart, I think the problem here is the notion that agents are "responsible".
>  Especially when "responsibility" is later defined in terms of accountability -
> I can't see a software agent as being accountable.  I don't know how to make
> sense of this, so it's hard for me to suggest alternatives.
>
> == Section 2.5, Simplified overview diagram ==
> == Section 2.6, PROV-N ... ==
>
> See earlier comments.  These is about PROV-DM terms, not provenance concepts, so
> I don't really think they belong here.
>
> I'd move them to start start of section 4.
>
> == Section 3, Illustration... ==
>
> I *still* think the positioning of this example disrupts the logical flow from
> concepts (section 2) to PROV-DM expressions (section 4).
>
> (I haven't reviewed the content of this section.)
>
>
> == 4. PROV-DM types and relations ==
>
> The enumeration of components seems to be repetitive.  Numbered items *and*
> component numbers?  (See earlier comment.)
>
> "In the first column, one finds concept names directly linking to their English
> definition. In the second column, ...".  Why not just use column headings in the
> table?  The reference to "English" description seems redundant.
>
> "In the rest of the section, each concept and relation is defined, in English
> initially, followed by a more formal definition and some example."  Similar
> comment.  Suggest:
> "In the rest of the section, each type and relation is defined informally,
> followed by a summary of the information used to represent the concept, and
> illustrated with PROV-N examples."
>
> == 4.1.1 Entity ==
>
> "An entity is a thing one wants to provide provenance for. For the purpose of
> this specification, things can be physical, digital, conceptual, or otherwise;
> things may be real or imaginary."  confuses entities and things again.  Suggest:
> "An entity is a thing one wants to provide provenance for. It can be physical,
> digital, conceptual, or otherwise, and may be real or imaginary."
>
> "An entity, written entity(id, [attr1=val1, ...]) in PROV-N, contains:" - I
> think this is wrong - an entity does not (in general) *contain*.  Suggest:
> "An entity, written entity(id, [attr1=val1, ...]) in PROV-N, has:"
>
> "id: an identifier for an entity;" - this is redundant and potentially
> confusing.  Suggest "id: an identifier".
>
> "attributes: an optional set of attribute-value pairs ((attr1, val1), ...)
> representing this entity's situation in the world." - I find this phrasing
> awkward and unclear.  Suggest:
> "attributes: an optional set of attribute-value pairs ((attr1, val1), ...)
> representing additional nformation about this entity."
>
> == 4.1.2, et seq ==
>
> (Similar editorial comments to those for 4.1.1 Entity.  I'm not repeating them
> all now for lack of time.)
>
>
> == Section 4.1.5 Start ==
>
> I find this whole section is confusing.  Starting with:
>
> "trigger: an optional identifier (e) for the entity triggering the activity;" -
> do you really mean to allow *any* entity here, rather than just agents?
>
> Looking forward to the example, I find the idea that an email (qua entity) can
> "trigger" an activity is incoherent.  Suppose the email is drafted and never
> sent.  It still exists as an entity, but can't be said to actually *trigger*
> anything.  For me, it is the act of actually sending (or receiving) an email
> that may trigger something, not the email as a passive entity.
>
>
> == Section 4.1.6, End ==
>
> (Similar comments to those above.)
>
>
> == Section 4.1.7, Communication ==
>
> It seems strange to me, given the pattern used for other concepts/expressions,
> that the communicated entity cannot be optionally named.  I find myself
> wondering if I've understood the definition properly.
>
>
> == Section 4.2.1, Agent ==
>
> Continues the muddle about responsibility.  I don't know what it all means
> (especially when the agent is running software).  See previous comments.
>
> Awkward and unnecessary phrase "situation in the world" again.  See earlier for
> suggested phrasing.
>
>
> == Section 4.3.1 Derivation ==
>
> "A derivation is a transformation of an entity into another, a construction of
> an entity into another, or an update of an entity, resulting in a new one."
> seems ungrammatical.  Suggest:
> "A derivation is a transformation of an entity into another, a construction of
> an entity *from* another, or an update of an entity, resulting in a new one."
>
>
> == Section 4.5 Collections ==
>
> I'm not understanding why this needs to be part of the core PROV-DM, and cannot
> be habdled by domain specific notions of aggregation.
>
> The stated goal is that "it is also of interest to be able to express the
> provenance of the collection itself" - this could be done equally well with a
> domain-specific collection notion, AFAICT.
>
> See also earlier comments.
>
>
> == Section 4.6, Annotations ==
>
> I'm still not seeing why these are needed as part of the core DM. There's no
> associated inference that I am aware of, and additional information can be added
> via attributes, so I'm not seeing what useful additional expressive capability
> this affords.
>
>
> == Section 4.7.4 Attribute ==
>
> Is an attribute really just a qualified name, or is it a pair consisting of a
> qualified name and a value?
>
>
> == Section 5, Extensibility points ==
>
> This section makes little sense to me.  The obvious extensibility points of
> sub-typing and sub-properties of defined PROV-DM terms isn't mentioned.
>
> The use of new attributes seems reasonable, though it's not entirely clear how
> they act as extension points, and the mention of "perspective on the world"
> doesn't mean anything to me.
>
> I cannot see how notes, which are defined to be pretty much semantics-free, can
> be described as an extensibility point - they don't actually add any expressive
> power that I can see.
>
> The remaining points I just don't get.
>
> I think this whole notion of extensibility needs to be treated more carefully
> and comprehensively if it is to be taken seriously.  Otherwise expect developers
> to ignore this and just use extensibility options in the representation
> substrate (e.g. RDF) used.
>
> == Section 6 ==
>
> I think this section is completely redundant and out-of-place, and could be
> removed without any loss.
>
> ...
>
> That's it for now.
>
> (BTW, my email access is patchy, so I may not be able to respond promptly to any
> follow-up discussion.)
>
> #g
> --
>
>
>
>
>



-- 
--
Dr. Paul Groth (p.t.groth@vu.nl)
http://www.few.vu.nl/~pgroth/
Assistant Professor
Knowledge Representation & Reasoning Group
Artificial Intelligence Section
Department of Computer Science
VU University Amsterdam
Received on Friday, 6 April 2012 20:37:08 UTC