PROV-ISSUE-331: feedback on PROV-Dm WD5 from Graham Klyne on 2012-04-06 (public-prov-wg@w3.org from April 2012)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Fri, 06 Apr 2012 20:51:13 +0100
To: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4F7F4931.2010108@zoo.ox.ac.uk>

Re:
http://dvcs.w3.org/hg/prov/raw-file/default/model/releases/ED-prov-dm-20120402/prov-dm.html
(Retrieved on 2012-04-03)

While this has many improvements over previous documents, I still feel that
there are several respects in which the document does not really serve its
intended purpose.

Generally, I found the tone and phrasing were more akin to academic rhetoric,
whose purpose is to persuade a peer of the truth of some proposition, than a
technical standard whose aim should be to *specify*, *inform* and where
necessary to *explain*. Especially for developers who will have to use this
material as a reference source. Thus, I found much of what I read, particularly
in the introductory section, had far to much justification (some of which was
obvious, other aspects of which were just "noise") which didn't help to to
understand what was being presented, or how to use it.

I also still have problems with the overall organization. In particular, I
(still) find the example in section 3 breaks the hoped-for flow between the
section 2 overview (which I also now think is mis-titled) and the provenance
expression details in section 4. I also don't think the final two subsections
of section 2 belong there, as they deal with provenance expression details, not
concepts.

Finally, I found many examples of unusual or awkward phrasing which I found to
be unhelpful, confusing or in some cases just plain wrong.

To summarize: if we expect the next public working draft to be nearly ready for
last, then I don't think this document is ready for release.

Details follow.

...

== Abstract ==

The phrase "derivations between entities" is strange and confusing. I think you
mean something like "derivation of entities from other entities".

"Properties that link entities that refer to a same thing". I think this is
just wrong: I don't believe that entities *refer*. I think you mean something
like "Properties that link entities that are based on the same thing".

"collections of entities, whose provenance itself can be tracked" - this feels
vaguely ungrammatical, and I'm not quite sure what this is trying to express.
In any case, I'll argue later that I don;t see why this is necessary as part of
the provenance core model. (What I'm not seeing here is anything I can
recognize as the notion of accounts, which allow for provenance of provenance to
be expressed.)

Here, and later in the document, there are references to "natural language". I
believe this is a term of art that is meaningful only to those who have exposure
to formal languages, as a way of distinguishing, and may be confusing to some
readers. In the abstract, I'd suggest just dropping this - the rest of the
sentence carries the intended meaning.

I'm not sure what you mean by "systematically defines". Just "defines" would
do, I think.

== Status of this document ==

The heading "how to read this document" is, I think, both patronizing and
inaccurate. And the following comments seem to significantly replicate the
content of the preceding text. I'd suggest moving descriptive material about
the documents into the preceding text, and drop the stuff that tries to tell
people what to read.

"Fourth public working draft". Really!! Are we really up to 4 with this? I
lose count.

== Introduction ==

"how it should be integrated with other diverse information sources". I find
this phrase to be vague and unclear, and hence unhelpful. I'd suggest dropping
this, and changing "... help those users to make trust judgements" in the next
sentence to read:

"... help those users to decide which information to include in their analyses,
and which to exclude."

"The idea that ... a pragmatiuc approach is to consider ..." add's no useful
value. I suggest replacing all of this with "We consider ...".

"the vision is that" is pure noise. Suggest deleting this. This whole
paragraph seems to be an unnecessary repetition of what the previous says.
While I sometimes think that a repeated summary can be useful, in this case I
think it would be more helpful to simplify the preceding paragraph.

The material that starts with "A set of specifications, ..." seems to be pure
repetition of material contained in the "status of this document" - is it really
necessary to repeat it here?

The listing of "components!" seems to be greatly redundant. Each component is
both numbered (N) and introduced as "component N". I think a simple numbered
list without the "component N" tags would suffice.

Two paragraphs starting with "This specification intentionally presents..." -
these paragraphs are loaded with unnecessary self-justification. I think a
simpler statement along the lines of:

"This specification presents the key concepts of the PROV data model and
provenance expressions, without specific concern for how they are applied. A
companion document [PROV-DM-CONSTRAINTS] discusses some possible constraints on
the application of this model, and corresponding useful inferences that may be
available when those constraints are known to be satisfied."

[[The next comment is rendered moot if the previous one is accepted...]]
Paragraph: "However, if data changes...". To an uninitiated reader, it is not
at all clear what is meant by "data" here. I'd suggest something like "If a
thing about which provenance is expressed is subject to change, it is
challenging to express its provenance precisely (e.g. the data from which a
daily weather report is derived will change from day to day)." Drop the
reference to other metadata here - it adds nothing of value.

@@(note to self) raise a separate issue about how to describe this "refinement".
I know I have argued for "refinement" over the idea of an "updated" or
"modified" provenance model, but the term is still a bit vague. I find myself
leaning toward a notion of a "strict" interpretation of provenance that in turn
allows certain inferences to be drawn if the supplied provenance satisfies
certain strictness criteria (constraints).

== 1.2 PROV namespace ==

This section glibly introduces the notion of a "namespace" without explaining
(or citing) what it means.

"The PROV namespace is http://www.w3.org/prov#". This is WRONG.
http://www.w3.org/prov# is a URI, not a namespace (or, more precisely, it's a
string that conforms to URI syntax).

What should be said is something like: "The names for concepts, attributes and
other reserved names introduced by this document belong to a namespace
identified by the URI http://www.w3.org/prov#".

And: what is the consequence of these names belonging to a namespace? I think
it would be appropriate to cite the corresponding XML and RDF documents that
deal with namespace issues [1] [2].

[1] http://www.w3.org/TR/REC-xml-names/

[2] http://www.w3.org/TR/REC-rdf-syntax/ (sections 6.1.2, 6.1.4, etc. These
define how RDF/XML forms a URI-reference by appending a local name to a
namespace URI.)

== Section 2, PROV-DM staring points ==

I think this section is mis-titled.

I think it should be: "2. Introduction to provenance concepts", since that is
what most of the section is about.

In light of this, the final two sub-sections seem mis-placed, and I suggest they
should be part of the early material in section 4.

"... that a novice reader would write in a first instance". Yuk! How
patronizing! Also, a reference here to "natural language" (see previous). I
would phrase this whole paragraph thus:

"This section introduces provenance concepts with informal descriptions and
illustrative examples. Later (section @@ref), we describe how these concepts
are described using PROV-DM types and relations."

(where @@ref should be in another section that actually deals with PROV-DM terms.)

== 2.1 Entity and Activity ==

"The term things encompasses..." - I find this phrasing awkward and potentially
confusing - are we talking here about things or entities? I suggest simply
"These encompass ..."

The final sentence is mostly noise. Why not just "Any Web resource may be an
entity."?

"For the purpose of this specification..." is just noise. Also, confusing
reference to "entities" and "things". Suggest for this para: "An entity is a
thing one wants to provide provenance for, which may be physical, digital,
conceptual, or otherwise; entities may be real or imaginary."

"This action can take multiple forms: ..." - this is confusing; are we talking
about a single activity having multiple forms, or different activities having
different forms. I think you mean the latter, hence I suggest: "An activity is
something that occurs over a period of time and acts upon or with entities. They
may include consuming, processing, transforming, modifying, relocating, using,
generating, or other associations with entities."

== 2.2, et seq. ==

I find similar issues with the wording of subsequent sections, but I haven't
gone through every one for lack of time. But I hope you get the general thrust
from the above.

== 2.3 Agents and other types of entities ==

I think this exhibits poor organization of the material. I think Agents and
Plans are related, and suggest a sub-section for them. Collections and accounts
don't have any obvious relationship, and IMO should be separated.

Concerning collections, it is not at all clear to me that these need to be in
the core PROV-DM. By including them here, you impose a particular view of
collections that may not be appropriate (somewhere, though I can't immediately
find where, there is mention of a collection being a key-value map). Domains
that deal with collections have their own models for these, so why not let this
be an aspect for domain-specific extension?

I think accounts should have a section of their own, since they underpin the key
feature of supporting provenance0-of-provenance.

However, I have a problem with the description "An account is an entity that
contains a bundle of provenance descriptions." I think that this should be "An
account *is* an entity that is a bundle of provenance descriptions." That is, I
don't think the core DM needs to or should expose the notion of containment,
since that begs more questions.

== 2.4 Attribution, association and responsibility ==

I find the expression of these ideas to be hopelessly muddled, and incoherent.
In particular, it seems to be self-contradictory with respect to the notion of
"responsibility" (also with section 2.3):

"An agent is a type of entity that bears some form of responsibility for an
activity taking place."
"Software for checking the use of grammar in a document may be defined as an agent"
"Agents are defined as having some kind of responsibility for activities."
"[an association may be] an XSLT transform launched by a user ..."
"An activity association is an assignment of responsibility to an agent for an
activity"
"Responsibility is the fact that an agent is accountable for ..."

At heart, I think the problem here is the notion that agents are "responsible".
Especially when "responsibility" is later defined in terms of accountability -
I can't see a software agent as being accountable. I don't know how to make
sense of this, so it's hard for me to suggest alternatives.

== Section 2.5, Simplified overview diagram ==
== Section 2.6, PROV-N ... ==

See earlier comments. These is about PROV-DM terms, not provenance concepts, so
I don't really think they belong here.

I'd move them to start start of section 4.

== Section 3, Illustration... ==

I *still* think the positioning of this example disrupts the logical flow from
concepts (section 2) to PROV-DM expressions (section 4).

(I haven't reviewed the content of this section.)

== 4. PROV-DM types and relations ==

The enumeration of components seems to be repetitive. Numbered items *and*
component numbers? (See earlier comment.)

"In the first column, one finds concept names directly linking to their English
definition. In the second column, ...". Why not just use column headings in the
table? The reference to "English" description seems redundant.

"In the rest of the section, each concept and relation is defined, in English
initially, followed by a more formal definition and some example." Similar
comment. Suggest:
"In the rest of the section, each type and relation is defined informally,
followed by a summary of the information used to represent the concept, and
illustrated with PROV-N examples."

== 4.1.1 Entity ==

"An entity is a thing one wants to provide provenance for. For the purpose of
this specification, things can be physical, digital, conceptual, or otherwise;
things may be real or imaginary." confuses entities and things again. Suggest:
"An entity is a thing one wants to provide provenance for. It can be physical,
digital, conceptual, or otherwise, and may be real or imaginary."

"An entity, written entity(id, [attr1=val1, ...]) in PROV-N, contains:" - I
think this is wrong - an entity does not (in general) *contain*. Suggest:
"An entity, written entity(id, [attr1=val1, ...]) in PROV-N, has:"

"id: an identifier for an entity;" - this is redundant and potentially
confusing. Suggest "id: an identifier".

"attributes: an optional set of attribute-value pairs ((attr1, val1), ...)
representing this entity's situation in the world." - I find this phrasing
awkward and unclear. Suggest:
"attributes: an optional set of attribute-value pairs ((attr1, val1), ...)
representing additional nformation about this entity."

== 4.1.2, et seq ==

(Similar editorial comments to those for 4.1.1 Entity. I'm not repeating them
all now for lack of time.)

== Section 4.1.5 Start ==

I find this whole section is confusing. Starting with:

"trigger: an optional identifier (e) for the entity triggering the activity;" -
do you really mean to allow *any* entity here, rather than just agents?

Looking forward to the example, I find the idea that an email (qua entity) can
"trigger" an activity is incoherent. Suppose the email is drafted and never
sent. It still exists as an entity, but can't be said to actually *trigger*
anything. For me, it is the act of actually sending (or receiving) an email
that may trigger something, not the email as a passive entity.

== Section 4.1.6, End ==

(Similar comments to those above.)

== Section 4.1.7, Communication ==

It seems strange to me, given the pattern used for other concepts/expressions,
that the communicated entity cannot be optionally named. I find myself
wondering if I've understood the definition properly.

== Section 4.2.1, Agent ==

Continues the muddle about responsibility. I don't know what it all means
(especially when the agent is running software). See previous comments.

Awkward and unnecessary phrase "situation in the world" again. See earlier for
suggested phrasing.

== Section 4.3.1 Derivation ==

"A derivation is a transformation of an entity into another, a construction of
an entity into another, or an update of an entity, resulting in a new one."
seems ungrammatical. Suggest:
"A derivation is a transformation of an entity into another, a construction of
an entity *from* another, or an update of an entity, resulting in a new one."

== Section 4.5 Collections ==

I'm not understanding why this needs to be part of the core PROV-DM, and cannot
be habdled by domain specific notions of aggregation.

The stated goal is that "it is also of interest to be able to express the
provenance of the collection itself" - this could be done equally well with a
domain-specific collection notion, AFAICT.