Review of PROV-DM (WD4) up to section 5 from Graham Klyne on 2012-02-23 (public-prov-wg@w3.org from February 2012)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Thu, 23 Feb 2012 13:16:20 +0000
To: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4F463C24.4030300@zoo.ox.ac.uk>
Reviewing: 
http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html

Summary: I'm sorry to say that I don't think the document even starts to bring 
in the kind of simplification discussed at the F2F meeting, which is required if 
this spec is to gain traction with web developers.

I find the document is still difficult to read, and in a full morning of 
reviewing it I've only got as far as section 5.  I think further *radical* 
simplification is required for the data model description, and I think it's 
possible without losing any essential information about the model.

...

(Nit: when I load this document from a local copy of the repository, I get an 
error reported indicating a problem with fetching the CSS.   It loads OK from 
the above URI.  Is there a problematic relative URI reference in the source 
document?)

...

I thought we'd agreed at F2F to provide a simple "scruffy" introduction to the 
DM (part 1), then introduce the requirement and refinements for more formally 
tractable provenance expressions that can be used to build accurate historical 
records over multiple related artifacts (part 2).   The document I'm reading 
does very little that I can see to make the prov-dm more approachable, as was 
indicated that we need to do at the F2F.  As far as I can tell, the only thing 
that has been in this direction is to *add* a new section on interpretation. 
This, of itself, does nothing to simplify the DM description.

I think we should be placing far more emphasis on making it a simple as we 
possibly can for information providers to publish provenance.  Consider that the 
primary beneficiaries of provenance information are the *consumers* of published 
information, not the *publishers*, so if we make life unnecessarily hard for 
publishers we're shooting ourselves in the collective foot.  From this, I think 
the initial introduction to the DM needs to be radically simplified to the 
extent that a developer can spend 10-15 minutes glancing at it and think "oh 
yes, I can easily add this to my output data".  If necessary, we push some of 
the work of understanding what needs to be done to harmonize the data to make it 
more suitable for building a historical record towards the consumer.

...

With this in mind:

Section 2:

The introductory material in section 2.1 is unhelpful, and I propose it be 
removed from the introduction.  Most of this material is not important until we 
come to consider the more formal aspects of the DM.  With the exception of 
2.1.2.1 about events, which I think should be introduced in the PROV-DM core 
model section.  Similarly sections 2.2 and 2.3 (maybe moving the two 
introductory sentences of 2.2 into section 2.4).  Thus section 2 would become 
just a very brief intro to the notation used for describing ASN, and maybe this 
could be moved into the PROV-DM core section (sect 5).

Section 3 looks generally useful.  But it still mentions an "account record", 
which I understood was being dropped.  It also mentions "alternateOf" and 
"specializationOf" which are not necessary for a "scruffy" introduction to 
provenance, so I suggest mention of these is dropped from here.  I suggest 
dropping the sentence about core and common relations - it's just noise.  With 
the removal of accounts, I think the whole purpose of notes/annotation records 
*as part of the provenance model* has become moot, and suggest that these be 
dropped from the spec.  There's nothing to prevent annotations being added to 
the provenance data as rdfs:comment or rdfs:label values.  I suggest dropping 
the mention of extensibility points: again, it's just noise at this point.

Section 4:  to my mind, this example section adds no useful information and 
doesn't help understanding of the (on account of being harder to follow than the 
ASN model description), and suggest that it be dropped.  Alternatively, I 
suggest moving it to an appendix.

Section 5: this is the vital core of this document.  Section 3 provides a very 
useful high-level overview, so this section can just get down to describing the 
constructs.

I note that ASN is mis-named: it's not really an *abstract* syntax notation; 
it's quite concrete, so it's more like a (technology-neutral) functional syntax 
notion.  @@raise separate issue for this?

Section 5.1:  prov-dm is a data model, not an implementation, right?  So why do 
we need to introduce "housekeeping constructs ... to facilitate their 
interchange"?  Suggest dropping most of the discussion of "record container", 
and simply introduce the "recordContainer" and "namespaceDeclaration" 
productions along with production for "record".


Section 5.2.1:  Entity record

Suggest drop "In PROV-DM, " - it's redundant.

Suggest the examples focus more on web documents, with "car" as more of an 
afterthought.  Primary use will probably be to describe web documents, sop lets 
keep this at front-of-mind?

Suggest dropping all mentions of "asserters viewpoint" and "situation in the 
world" - these don't matter for the "scruffy" view of provenance.

Suggest dropping the idea that the attributes somehow define the entity ("whose 
situation in the world is represented by the attribute-value pairs").  They're 
just there to provide information about the entity, and as hooks for 
interoperability.  (I argued previously for dropping attributes completely, but 
was persuaded otherwise by the interoperability argument from the provenance 
challenges - don't try to make more of them.)

Suggest drop issue mentioning "characterization interval" - I think it's now a 
non-issue.

I think the issue of uniqueness of identifiers should be dealt with in the 
introduction to ASN, not under the individual elements.

Under "further considerations", suggest dropping all but 3rd and 6th bullets. 
In the 6th bullet, I don't understand the stuff about "a namespace also declares 
the number of occurrences...".  I have deep concern about what this might be 
trying to say.  In any case, shouldn't this be covered under a description of 
the namespace, if needed?

I think the material about "activities" and "plans" really doesn't belong in 
this section.


Section 5.2.2 Activity record

Suggest drop "In PROV-DM, " - it's redundant.

Didn't we discuss replacing the start, end times by events?  I don't recall the 
outcome - I'm just mentioning this in case it's been missed.

For the example, I suggest leading on something to do with information on the web.

It was a surprise to me to learn that PROV-DM has reserved attributes.  If 
attributes are in the model to support interoperability with other provenance 
frameworks (which is my understanding from previous discussions), this feels 
like a poor design choice.  Maybe it should be a separate parameter?  In any 
case, I think the intent of this "subtyping" needs to be explained.

If this is to be a "scruffy" introduction, I think the reference to 
start-view-end is not needed here.  In any case, the cross-reference is almost 
impossible to locate in a printed copy of the spec.

I think the issue of uniqueness of identifiers should be dealt with in the 
introduction to ASN, not under the individual elements.

Suggest dropping the "further considerations bullets."

Did we not agree that activities *would* be allowable as entities (especially if 
entities are just stuff that can identified).?


Section 5.2.3, Agent record

Having introduced a framework for subtyping for activities, why not use the same 
approach for different types of agents ... especially considering that two major 
agent types are defined by reference to existing foaf definitions?  I suggest 
not asserting the claim that the agent types are mutually exclusive.

Suggest drop reference to "situation in the world".

Suggest drop discussion of inferences of agent records - if needed, they should 
come later along with a more formal ("non-scruffy") treatment of the data model.


Section 5.2.4, Note record

I think this should be dropped from the data model.  I don't see that it serves 
any needed *provenance* function.  "extra information" can be added by 
format-specific extensions.  As such, this record type only adds noise to the 
specification.


Section 5.3.1.1 generation  record

I believe the ASN syntax given verges on being ambiguous, and is unnecessarily 
tricky to parse by a human or machine consumer; e.g. consider:

   wasGeneratedBy(a,b)
   wasGeneratedBy(a,b,)

The presence of the trailing comma in the second example completely changes the 
parse tree productions associated with a and b.  I think it would be much easier 
if ASN simply required a dummy activity identifier to be provided; i.e. don't 
make aidentifier optional.  Indeed, rather than allowing optional identifiers 
anywhere in the ASN, one might use a placeholder (e.g. '_') for any unspecified 
identifier, which would make the overall syntax much more regular.

Since the id is used only for annotations, I suggest dropping it (see section 
5.2.4 comment above).

If this is to be a "scruffy" introduction, I think the reference to 
generation-within-activity is not needed here.  In any case, the cross-reference 
is almost impossible to locate in a printed copy of the spec.  Suggest drop this.

Similarly, suggest dropping the structural constraint here.


Section 5.3.1.2 Usage record

Suggest drop "In PROV-DM, " - it's redundant.

Why is there an identifier for a usage record?

Suggest lead with example of consuming a web resource.

Suggest drop reference to annotation record (see above note about 5.2.4)

Suggest drop reference to interpretation here


Section 5.3.2.1 Association record

Para 3: Suggest drop first sentence, and simplify; i.e. just say; "Activities 
may reflect the execution of a plan..."

Para 4, there quite a bit of redundancy redundancy here.  Suggest:
[[
A plan is the description of a set of actions or steps intended by one or more 
agents to achieve some goal. PROV-DM is not prescriptive about the nature of 
plans, their representation, the actions and steps they consist of, and their 
intended goals. A plan can be a workflow for a scientific experiment, a recipe 
for a cooking activity, or a list of instructions for a micro-processor 
execution. Plans are entities, which may have associated provenance. An activity 
may be associated with multiple plans, allowing for descriptions of activities 
initially associated with a plan, which was changed, on the fly, as the activity 
progresses. Plans can be successfully executed or they can fail. We expect 
applications to exploit PROV-DM extensibility mechanisms to capture the rich 
nature of plans and associations between activities and plans.
]]

Para 5: I see no value in cross-referencing the responsibility record here. 
Suggest dropping this paragraph.

Why is there an identifier for an association record?


Section 5.3.2.2 Start and End records

This seems to overlap with start, end parameters on an activity.   It's not 
immediately clear how they play together.

Should this record not describe an "event"?  Then the id should identify the 
start/end event, not the record.  cf. Issue 207.

Identifiers should denote activities and agents, *not records*.


Section 5.3.3.1 Responsibility record

Suggest drop "To promote take-up... " and instead lead with a simple 
introduction of what the record describes.

Para 3: It seems to me that the responsibility record should stand independently 
of any association record.  Suggest drop "Given an activity association 
record... (...)"

Why is there an identifier for an responsibility record?


Section 5.3.3.2 Derivation record

Suggest drop "In PROV-DM, "

This whole section seems way to complicated.  My understanding is that the 
"Common relations" section is intended to cover those useful short-cut 
expressions that can be expressed with less convenience in the core model.  As 
such, I think the derivation record should be a "common" rather than a "core" 
relation.

Aside from that, I really don't see the utility of all this stuff about precise 
and imprecise derivations.  I think there is just one useful relation to define, 
roughly corresponding to "imprecise n-derivation record" here:

- I note that the "imprecise 1-derivation record" and "imprecise n-derivation 
record" are not syntactically distingushable, so there's no point in discussing 
the difference.

- the "precise 1-derivation record" can be expressed using an activity, usage 
and generation record: I'm not convinced this alternative syntax is really 
buying anything worthwhile.

Suggest radical simplification along these lines, and move to section 6.  Don't 
introduce all the formal stuff until a later section handling more formal 
treatments.


Section 5.3.3.3 Alternate and Specialization records

In considering a "scruffy" view of provenance, these relations aren't really 
needed.  However, they do underpin a more formal treatment in the face of 
dynamic resources.

I would give serious consideration to introducing these later, when the more 
formal treatment of dynamic resources is considered.


Section 5.3.4.  Annotation record

I think this serves no needed purpose, and should be dropped.  (See earlier 
comments for section 5.2.4.)


Section 5.4.1 Account record

I understood we'd agreed to drop this.


Section 5.4.2 Record container

I think this is mainly an artifact of the ASN syntax, and should be introduced 
more briefly in the introductory section 5.1 (see previous comments)


Section 5.5.1 Attribute

I think the "optional-attribute-value" productions covered in section 5.2.1 
(Entity) should be covered here since they apply to multiple record types.

I would prefer to see attribute names presented as being IRIs in the data model, 
with the namespace-qualified CURIE syntax available as a convenience in the ASN 
presentation.

I think the predefined attribute names should be dealt with in a separate 
section.  I'm actually not convinced this is the best design choice for 
properties with DM-defined meaning, as opposed to (say) using separate record 
parameters, but that's more of a style issue than a fundamental objection.

As indicated earlier, I think the whole discussion of derivation steps is too 
much detail, and I don't see the value, and would suggest dropping the 
prov:steps attribute.

For attribute prov:label:  why not just use rdfs:label?


Section 5.5.2 Identifiers

The text says they are *qualified* names, but in most of the example they are 
not.  Also, some identifiers are described as having local scope: this is not 
compatible with using *qualified* names which are essentially IRIs.

The text describes identifiers as denoting *records* (e.g. entity record) - I 
think this is wrong, and in any case is inconsistent with text elsewhere in the 
document.  They should demote "entity", "activity", "agent", etc.


Section 5.5.3 Literal

"A PROV-DM Literal represents a value whose interpretation is outside the scope 
of PROV-DM."  What a Terrible Failure... the whole point of languages 
introducing literals is precvisely that their interpretation *is* defined by the 
language.  If not, they might as well be names.

I think the intent is that their interpretation is defined by reference to the 
corresponding xsd datatype definition, or some other datatype definition, that 
is effectively incorporated by reference.

I'd suggest that an interpretation of literals is provided by:
- http://www.w3.org/TR/rdf-mt/#gddenot
- http://www.w3.org/TR/rdf-mt/#DTYPEINTERP

Section 5.5.4 Time

No syntax production provided or indicated.

I think it's unnecessary and inappropriate to indicate where time is used.  It's 
just something to go wrong as the document evolves.


Section 5.5.5 Asserter

Do we really still need this (now accounts are gone).  Suggest dropping.


Section 5.5.6 Namespace

I'd suggest covering this with the introduction of the record container syntax 
production


Section 5.5.7 Location

Do we have any explicit use of this?  if not, I'd suggest dropping it.

...

I'm out of time and stopping my review here.  There's a general pattern here 
that I'd also apply to section 6.

I'd then take section 7 and (probably) exp[and it into several sections ("Part 
2") introducing and describing a more formal treatment of provenance that can be 
used to bridge from and refine the "scruffy" view to something that can be 
assembled and processed according to inferences that flow from the formal 
semantics.  A key point to introduce here would be that it is possible to create 
provenance statements that cannot possibly satisfy the formal semantics, and to 
indicate what additional constraints and disciplines should be applied to ensure 
that they can (and hence to make the inferences that flow from those semantics 
valid).

#g
--
Received on Thursday, 23 February 2012 13:18:01 UTC