PROV-ISSUE-331: feedback on PROV-Dm WD5

(Retrieved on 2012-04-03)

While this has many improvements over previous documents, I still feel that 
there are several respects in which the document does not really serve its 
intended purpose.

Generally, I found the tone and phrasing were more akin to academic rhetoric, 
whose purpose is to persuade a peer of the truth of some proposition, than a 
technical standard whose aim should be to *specify*, *inform* and where 
necessary to *explain*.  Especially for developers who will have to use this 
material as a reference source.  Thus, I found much of what I read, particularly 
in the introductory section, had far to much justification (some of which was 
obvious, other aspects of which were just "noise") which didn't help to to 
understand what was being presented, or how to use it.

I also still have problems with the overall organization.  In particular, I 
(still) find the example in section 3 breaks the hoped-for flow between the 
section 2 overview (which I also now think is mis-titled) and the provenance 
expression details in section 4.  I also don't think the final two subsections 
of section 2 belong there, as they deal with provenance expression details, not 

Finally, I found many examples of unusual or awkward phrasing which I found to 
be unhelpful, confusing or in some cases just plain wrong.

To summarize: if we expect the next public working draft to be nearly ready for 
last, then I don't think this document is ready for release.

Details follow.


== Abstract ==

The phrase "derivations between entities" is strange and confusing.  I think you 
mean something like "derivation of entities from other entities".

"Properties that link entities that refer to a same thing".  I think this is 
just wrong:  I don't believe that entities *refer*.  I think you mean something 
like "Properties that link entities that are based on the same thing".

"collections of entities, whose provenance itself can be tracked" - this feels 
vaguely ungrammatical, and I'm not quite sure what this is trying to express. 
In any case, I'll argue later that I don;t see why this is necessary as part of 
the provenance core model.  (What I'm not seeing here is anything I can 
recognize as the notion of accounts, which allow for provenance of provenance to 
be expressed.)

Here, and later in the document, there are references to "natural language".  I 
believe this is a term of art that is meaningful only to those who have exposure 
to formal languages, as a way of distinguishing, and may be confusing to some 
readers.  In the abstract, I'd suggest just dropping this - the rest of the 
sentence carries the intended meaning.

I'm not sure what you mean by "systematically defines".  Just "defines" would 
do, I think.

== Status of this document ==

The heading "how to read this document" is, I think, both patronizing and 
inaccurate.  And the following comments seem to significantly replicate the 
content of the preceding text.  I'd suggest moving descriptive material about 
the documents into the preceding text, and drop the stuff that tries to tell 
people what to read.

"Fourth public working draft".  Really!!  Are we really up to 4 with this?  I 
lose count.

== Introduction ==

"how it should be integrated with other diverse information sources".  I find 
this phrase to be vague and unclear, and hence unhelpful.  I'd suggest dropping 
this, and changing "... help those users to make trust judgements" in the next 
sentence to read:

"... help those users to decide which information to include in their analyses, 
and which to exclude."

"The idea that ... a pragmatiuc approach is to consider ..." add's no useful 
value.  I suggest replacing all of this with "We consider ...".

"the vision is that" is pure noise.  Suggest deleting this.  This whole 
paragraph seems to be an unnecessary repetition of what the previous says. 
While I sometimes think that a repeated summary can be useful, in this case I 
think it would be more helpful to simplify the preceding paragraph.

The material that starts with "A set of specifications, ..." seems to be pure 
repetition of material contained in the "status of this document" - is it really 
necessary to repeat it here?

The listing of "components!" seems to be greatly redundant.  Each component is 
both numbered (N) and introduced as "component N".  I think a simple numbered 
list without the "component N" tags would suffice.

Two paragraphs starting with "This specification intentionally presents..." - 
these paragraphs are loaded with unnecessary self-justification.  I think a 
simpler statement along the lines of:

"This specification presents the key concepts of the PROV data model and 
provenance expressions, without specific concern for how they are applied.  A 
companion document [PROV-DM-CONSTRAINTS] discusses some possible constraints on 
the application of this model, and corresponding useful inferences that may be 
available when those constraints are known to be satisfied."

[[The next comment is rendered moot if the previous one is accepted...]]
Paragraph: "However, if data changes...".  To an uninitiated reader, it is not 
at all clear what is meant by "data" here.  I'd suggest something like "If a 
thing about which provenance is expressed is subject to change, it is 
challenging to express its provenance precisely (e.g. the data from which a 
daily weather report is derived will change from day to day)."  Drop the 
reference to other metadata here - it adds nothing of value.

@@(note to self) raise a separate issue about how to describe this "refinement". 
  I know I have argued for "refinement" over the idea of an "updated" or 
"modified" provenance model, but the term is still a bit vague.  I find myself 
leaning toward a notion of a "strict" interpretation of provenance that in turn 
allows certain inferences to be drawn if the supplied provenance satisfies 
certain strictness criteria (constraints).

== 1.2 PROV namespace ==

This section glibly introduces the notion of a "namespace" without explaining 
(or citing) what it means.

"The PROV namespace is".  This is WRONG. is a URI, not a namespace (or, more precisely, it's a 
string that conforms to URI syntax).

What should be said is something like: "The names for concepts, attributes and 
other reserved names introduced by this document belong to a namespace 
identified by the URI".

And: what is the consequence of these names belonging to a namespace?  I think 
it would be appropriate to cite the corresponding XML and RDF documents that 
deal with namespace issues [1] [2].


[2] (sections 6.1.2, 6.1.4, etc.  These 
define how RDF/XML forms a URI-reference by appending a local name to a 
namespace URI.)

== Section 2, PROV-DM staring points ==

I think this section is mis-titled.

I think it should be: "2. Introduction to provenance concepts", since that is 
what most of the section is about.

In light of this, the final two sub-sections seem mis-placed, and I suggest they 
should be part of the early material in section 4.

"... that a novice reader would write in a first instance".  Yuk!  How 
patronizing!  Also, a reference here to "natural language" (see previous).  I 
would phrase this whole paragraph thus:

"This section introduces provenance concepts with informal descriptions and 
illustrative examples.  Later (section @@ref), we describe how these concepts 
are described using PROV-DM types and relations."

(where @@ref should be in another section that actually deals with PROV-DM terms.)

== 2.1 Entity and Activity ==

"The term things encompasses..." - I find this phrasing awkward and potentially 
confusing - are we talking here about things or entities?  I suggest simply 
"These encompass ..."

The final sentence is mostly noise.  Why not just "Any Web resource may be an 

"For the purpose of this specification..." is just noise.  Also, confusing 
reference to "entities" and "things".  Suggest for this para:  "An entity is a 
thing one wants to provide provenance for, which may be physical, digital, 
conceptual, or otherwise; entities may be real or imaginary."

"This action can take multiple forms: ..." - this is confusing; are we talking 
about a single activity having multiple forms, or different activities having 
different forms.  I think you mean the latter, hence I suggest: "An activity is 
something that occurs over a period of time and acts upon or with entities. They 
may include consuming, processing, transforming, modifying, relocating, using, 
generating, or other associations with entities."

== 2.2, et seq. ==

I find similar issues with the wording of subsequent sections, but I haven't 
gone through every one for lack of time.  But I hope you get the general thrust 
from the above.

== 2.3 Agents and other types of entities ==

I think this exhibits poor organization of the material.  I think Agents and 
Plans are related, and suggest a sub-section for them.  Collections and accounts 
don't have any obvious relationship, and IMO should be separated.

Concerning collections, it is not at all clear to me that these need to be in 
the core PROV-DM.  By including them here, you impose a particular view of 
collections that may not be appropriate  (somewhere, though I can't immediately 
find where, there is mention of a collection being a key-value map).  Domains 
that deal with collections have their own models for these, so why not let this 
be an aspect for domain-specific extension?

I think accounts should have a section of their own, since they underpin the key 
feature of supporting provenance0-of-provenance.

However, I have a problem with the description "An account is an entity that 
contains a bundle of provenance descriptions."  I think that this should be "An 
account *is* an entity that is a bundle of provenance descriptions."  That is, I 
don't think the core DM needs to or should expose the notion of containment, 
since that begs more questions.

== 2.4 Attribution, association and responsibility ==

I find the expression of these ideas to be hopelessly muddled, and incoherent. 
In particular, it seems to be self-contradictory with respect to the notion of 
"responsibility" (also with section 2.3):

"An agent is a type of entity that bears some form of responsibility for an 
activity taking place."
"Software for checking the use of grammar in a document may be defined as an agent"
"Agents are defined as having some kind of responsibility for activities."
"[an association may be] an XSLT transform launched by a user ..."
"An activity association is an assignment of responsibility to an agent for an 
"Responsibility is the fact that an agent is accountable for ..."

At heart, I think the problem here is the notion that agents are "responsible". 
  Especially when "responsibility" is later defined in terms of accountability - 
I can't see a software agent as being accountable.  I don't know how to make 
sense of this, so it's hard for me to suggest alternatives.

== Section 2.5, Simplified overview diagram ==
== Section 2.6, PROV-N ... ==

See earlier comments.  These is about PROV-DM terms, not provenance concepts, so 
I don't really think they belong here.

I'd move them to start start of section 4.

== Section 3, Illustration... ==

I *still* think the positioning of this example disrupts the logical flow from 
concepts (section 2) to PROV-DM expressions (section 4).

(I haven't reviewed the content of this section.)

== 4. PROV-DM types and relations ==

The enumeration of components seems to be repetitive.  Numbered items *and* 
component numbers?  (See earlier comment.)

"In the first column, one finds concept names directly linking to their English 
definition. In the second column, ...".  Why not just use column headings in the 
table?  The reference to "English" description seems redundant.

"In the rest of the section, each concept and relation is defined, in English 
initially, followed by a more formal definition and some example."  Similar 
comment.  Suggest:
"In the rest of the section, each type and relation is defined informally, 
followed by a summary of the information used to represent the concept, and 
illustrated with PROV-N examples."

== 4.1.1 Entity ==

"An entity is a thing one wants to provide provenance for. For the purpose of 
this specification, things can be physical, digital, conceptual, or otherwise; 
things may be real or imaginary."  confuses entities and things again.  Suggest:
"An entity is a thing one wants to provide provenance for. It can be physical, 
digital, conceptual, or otherwise, and may be real or imaginary."

"An entity, written entity(id, [attr1=val1, ...]) in PROV-N, contains:" - I 
think this is wrong - an entity does not (in general) *contain*.  Suggest:
"An entity, written entity(id, [attr1=val1, ...]) in PROV-N, has:"

"id: an identifier for an entity;" - this is redundant and potentially 
confusing.  Suggest "id: an identifier".

"attributes: an optional set of attribute-value pairs ((attr1, val1), ...) 
representing this entity's situation in the world." - I find this phrasing 
awkward and unclear.  Suggest:
"attributes: an optional set of attribute-value pairs ((attr1, val1), ...) 
representing additional nformation about this entity."

== 4.1.2, et seq ==

(Similar editorial comments to those for 4.1.1 Entity.  I'm not repeating them 
all now for lack of time.)

== Section 4.1.5 Start ==

I find this whole section is confusing.  Starting with:

"trigger: an optional identifier (e) for the entity triggering the activity;" - 
do you really mean to allow *any* entity here, rather than just agents?

Looking forward to the example, I find the idea that an email (qua entity) can 
"trigger" an activity is incoherent.  Suppose the email is drafted and never 
sent.  It still exists as an entity, but can't be said to actually *trigger* 
anything.  For me, it is the act of actually sending (or receiving) an email 
that may trigger something, not the email as a passive entity.

== Section 4.1.6, End ==

(Similar comments to those above.)

== Section 4.1.7, Communication ==

It seems strange to me, given the pattern used for other concepts/expressions, 
that the communicated entity cannot be optionally named.  I find myself 
wondering if I've understood the definition properly.

== Section 4.2.1, Agent ==

Continues the muddle about responsibility.  I don't know what it all means 
(especially when the agent is running software).  See previous comments.

Awkward and unnecessary phrase "situation in the world" again.  See earlier for 
suggested phrasing.

== Section 4.3.1 Derivation ==

"A derivation is a transformation of an entity into another, a construction of 
an entity into another, or an update of an entity, resulting in a new one." 
seems ungrammatical.  Suggest:
"A derivation is a transformation of an entity into another, a construction of 
an entity *from* another, or an update of an entity, resulting in a new one."

== Section 4.5 Collections ==

I'm not understanding why this needs to be part of the core PROV-DM, and cannot 
be habdled by domain specific notions of aggregation.

The stated goal is that "it is also of interest to be able to express the 
provenance of the collection itself" - this could be done equally well with a 
domain-specific collection notion, AFAICT.

See also earlier comments.

== Section 4.6, Annotations ==

I'm still not seeing why these are needed as part of the core DM. There's no 
associated inference that I am aware of, and additional information can be added 
via attributes, so I'm not seeing what useful additional expressive capability 
this affords.

== Section 4.7.4 Attribute ==

Is an attribute really just a qualified name, or is it a pair consisting of a 
qualified name and a value?

== Section 5, Extensibility points ==

This section makes little sense to me.  The obvious extensibility points of 
sub-typing and sub-properties of defined PROV-DM terms isn't mentioned.

The use of new attributes seems reasonable, though it's not entirely clear how 
they act as extension points, and the mention of "perspective on the world" 
doesn't mean anything to me.

I cannot see how notes, which are defined to be pretty much semantics-free, can 
be described as an extensibility point - they don't actually add any expressive 
power that I can see.

The remaining points I just don't get.

I think this whole notion of extensibility needs to be treated more carefully 
and comprehensively if it is to be taken seriously.  Otherwise expect developers 
to ignore this and just use extensibility options in the representation 
substrate (e.g. RDF) used.

== Section 6 ==

I think this section is completely redundant and out-of-place, and could be 
removed without any loss.


That's it for now.

(BTW, my email access is patchy, so I may not be able to respond promptly to any 
follow-up discussion.)


Received on Friday, 6 April 2012 19:52:16 UTC