- From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
- Date: Thu, 23 Feb 2012 13:16:20 +0000
- To: W3C provenance WG <public-prov-wg@w3.org>
Reviewing:
http://dvcs.w3.org/hg/prov/raw-file/7aadc6332722/model/ProvenanceModel.html
Summary: I'm sorry to say that I don't think the document even starts to bring
in the kind of simplification discussed at the F2F meeting, which is required if
this spec is to gain traction with web developers.
I find the document is still difficult to read, and in a full morning of
reviewing it I've only got as far as section 5. I think further *radical*
simplification is required for the data model description, and I think it's
possible without losing any essential information about the model.
...
(Nit: when I load this document from a local copy of the repository, I get an
error reported indicating a problem with fetching the CSS. It loads OK from
the above URI. Is there a problematic relative URI reference in the source
document?)
...
I thought we'd agreed at F2F to provide a simple "scruffy" introduction to the
DM (part 1), then introduce the requirement and refinements for more formally
tractable provenance expressions that can be used to build accurate historical
records over multiple related artifacts (part 2). The document I'm reading
does very little that I can see to make the prov-dm more approachable, as was
indicated that we need to do at the F2F. As far as I can tell, the only thing
that has been in this direction is to *add* a new section on interpretation.
This, of itself, does nothing to simplify the DM description.
I think we should be placing far more emphasis on making it a simple as we
possibly can for information providers to publish provenance. Consider that the
primary beneficiaries of provenance information are the *consumers* of published
information, not the *publishers*, so if we make life unnecessarily hard for
publishers we're shooting ourselves in the collective foot. From this, I think
the initial introduction to the DM needs to be radically simplified to the
extent that a developer can spend 10-15 minutes glancing at it and think "oh
yes, I can easily add this to my output data". If necessary, we push some of
the work of understanding what needs to be done to harmonize the data to make it
more suitable for building a historical record towards the consumer.
...
With this in mind:
Section 2:
The introductory material in section 2.1 is unhelpful, and I propose it be
removed from the introduction. Most of this material is not important until we
come to consider the more formal aspects of the DM. With the exception of
2.1.2.1 about events, which I think should be introduced in the PROV-DM core
model section. Similarly sections 2.2 and 2.3 (maybe moving the two
introductory sentences of 2.2 into section 2.4). Thus section 2 would become
just a very brief intro to the notation used for describing ASN, and maybe this
could be moved into the PROV-DM core section (sect 5).
Section 3 looks generally useful. But it still mentions an "account record",
which I understood was being dropped. It also mentions "alternateOf" and
"specializationOf" which are not necessary for a "scruffy" introduction to
provenance, so I suggest mention of these is dropped from here. I suggest
dropping the sentence about core and common relations - it's just noise. With
the removal of accounts, I think the whole purpose of notes/annotation records
*as part of the provenance model* has become moot, and suggest that these be
dropped from the spec. There's nothing to prevent annotations being added to
the provenance data as rdfs:comment or rdfs:label values. I suggest dropping
the mention of extensibility points: again, it's just noise at this point.
Section 4: to my mind, this example section adds no useful information and
doesn't help understanding of the (on account of being harder to follow than the
ASN model description), and suggest that it be dropped. Alternatively, I
suggest moving it to an appendix.
Section 5: this is the vital core of this document. Section 3 provides a very
useful high-level overview, so this section can just get down to describing the
constructs.
I note that ASN is mis-named: it's not really an *abstract* syntax notation;
it's quite concrete, so it's more like a (technology-neutral) functional syntax
notion. @@raise separate issue for this?
Section 5.1: prov-dm is a data model, not an implementation, right? So why do
we need to introduce "housekeeping constructs ... to facilitate their
interchange"? Suggest dropping most of the discussion of "record container",
and simply introduce the "recordContainer" and "namespaceDeclaration"
productions along with production for "record".
Section 5.2.1: Entity record
Suggest drop "In PROV-DM, " - it's redundant.
Suggest the examples focus more on web documents, with "car" as more of an
afterthought. Primary use will probably be to describe web documents, sop lets
keep this at front-of-mind?
Suggest dropping all mentions of "asserters viewpoint" and "situation in the
world" - these don't matter for the "scruffy" view of provenance.
Suggest dropping the idea that the attributes somehow define the entity ("whose
situation in the world is represented by the attribute-value pairs"). They're
just there to provide information about the entity, and as hooks for
interoperability. (I argued previously for dropping attributes completely, but
was persuaded otherwise by the interoperability argument from the provenance
challenges - don't try to make more of them.)
Suggest drop issue mentioning "characterization interval" - I think it's now a
non-issue.
I think the issue of uniqueness of identifiers should be dealt with in the
introduction to ASN, not under the individual elements.
Under "further considerations", suggest dropping all but 3rd and 6th bullets.
In the 6th bullet, I don't understand the stuff about "a namespace also declares
the number of occurrences...". I have deep concern about what this might be
trying to say. In any case, shouldn't this be covered under a description of
the namespace, if needed?
I think the material about "activities" and "plans" really doesn't belong in
this section.
Section 5.2.2 Activity record
Suggest drop "In PROV-DM, " - it's redundant.
Didn't we discuss replacing the start, end times by events? I don't recall the
outcome - I'm just mentioning this in case it's been missed.
For the example, I suggest leading on something to do with information on the web.
It was a surprise to me to learn that PROV-DM has reserved attributes. If
attributes are in the model to support interoperability with other provenance
frameworks (which is my understanding from previous discussions), this feels
like a poor design choice. Maybe it should be a separate parameter? In any
case, I think the intent of this "subtyping" needs to be explained.
If this is to be a "scruffy" introduction, I think the reference to
start-view-end is not needed here. In any case, the cross-reference is almost
impossible to locate in a printed copy of the spec.
I think the issue of uniqueness of identifiers should be dealt with in the
introduction to ASN, not under the individual elements.
Suggest dropping the "further considerations bullets."
Did we not agree that activities *would* be allowable as entities (especially if
entities are just stuff that can identified).?
Section 5.2.3, Agent record
Having introduced a framework for subtyping for activities, why not use the same
approach for different types of agents ... especially considering that two major
agent types are defined by reference to existing foaf definitions? I suggest
not asserting the claim that the agent types are mutually exclusive.
Suggest drop reference to "situation in the world".
Suggest drop discussion of inferences of agent records - if needed, they should
come later along with a more formal ("non-scruffy") treatment of the data model.
Section 5.2.4, Note record
I think this should be dropped from the data model. I don't see that it serves
any needed *provenance* function. "extra information" can be added by
format-specific extensions. As such, this record type only adds noise to the
specification.
Section 5.3.1.1 generation record
I believe the ASN syntax given verges on being ambiguous, and is unnecessarily
tricky to parse by a human or machine consumer; e.g. consider:
wasGeneratedBy(a,b)
wasGeneratedBy(a,b,)
The presence of the trailing comma in the second example completely changes the
parse tree productions associated with a and b. I think it would be much easier
if ASN simply required a dummy activity identifier to be provided; i.e. don't
make aidentifier optional. Indeed, rather than allowing optional identifiers
anywhere in the ASN, one might use a placeholder (e.g. '_') for any unspecified
identifier, which would make the overall syntax much more regular.
Since the id is used only for annotations, I suggest dropping it (see section
5.2.4 comment above).
If this is to be a "scruffy" introduction, I think the reference to
generation-within-activity is not needed here. In any case, the cross-reference
is almost impossible to locate in a printed copy of the spec. Suggest drop this.
Similarly, suggest dropping the structural constraint here.
Section 5.3.1.2 Usage record
Suggest drop "In PROV-DM, " - it's redundant.
Why is there an identifier for a usage record?
Suggest lead with example of consuming a web resource.
Suggest drop reference to annotation record (see above note about 5.2.4)
Suggest drop reference to interpretation here
Section 5.3.2.1 Association record
Para 3: Suggest drop first sentence, and simplify; i.e. just say; "Activities
may reflect the execution of a plan..."
Para 4, there quite a bit of redundancy redundancy here. Suggest:
[[
A plan is the description of a set of actions or steps intended by one or more
agents to achieve some goal. PROV-DM is not prescriptive about the nature of
plans, their representation, the actions and steps they consist of, and their
intended goals. A plan can be a workflow for a scientific experiment, a recipe
for a cooking activity, or a list of instructions for a micro-processor
execution. Plans are entities, which may have associated provenance. An activity
may be associated with multiple plans, allowing for descriptions of activities
initially associated with a plan, which was changed, on the fly, as the activity
progresses. Plans can be successfully executed or they can fail. We expect
applications to exploit PROV-DM extensibility mechanisms to capture the rich
nature of plans and associations between activities and plans.
]]
Para 5: I see no value in cross-referencing the responsibility record here.
Suggest dropping this paragraph.
Why is there an identifier for an association record?
Section 5.3.2.2 Start and End records
This seems to overlap with start, end parameters on an activity. It's not
immediately clear how they play together.
Should this record not describe an "event"? Then the id should identify the
start/end event, not the record. cf. Issue 207.
Identifiers should denote activities and agents, *not records*.
Section 5.3.3.1 Responsibility record
Suggest drop "To promote take-up... " and instead lead with a simple
introduction of what the record describes.
Para 3: It seems to me that the responsibility record should stand independently
of any association record. Suggest drop "Given an activity association
record... (...)"
Why is there an identifier for an responsibility record?
Section 5.3.3.2 Derivation record
Suggest drop "In PROV-DM, "
This whole section seems way to complicated. My understanding is that the
"Common relations" section is intended to cover those useful short-cut
expressions that can be expressed with less convenience in the core model. As
such, I think the derivation record should be a "common" rather than a "core"
relation.
Aside from that, I really don't see the utility of all this stuff about precise
and imprecise derivations. I think there is just one useful relation to define,
roughly corresponding to "imprecise n-derivation record" here:
- I note that the "imprecise 1-derivation record" and "imprecise n-derivation
record" are not syntactically distingushable, so there's no point in discussing
the difference.
- the "precise 1-derivation record" can be expressed using an activity, usage
and generation record: I'm not convinced this alternative syntax is really
buying anything worthwhile.
Suggest radical simplification along these lines, and move to section 6. Don't
introduce all the formal stuff until a later section handling more formal
treatments.
Section 5.3.3.3 Alternate and Specialization records
In considering a "scruffy" view of provenance, these relations aren't really
needed. However, they do underpin a more formal treatment in the face of
dynamic resources.
I would give serious consideration to introducing these later, when the more
formal treatment of dynamic resources is considered.
Section 5.3.4. Annotation record
I think this serves no needed purpose, and should be dropped. (See earlier
comments for section 5.2.4.)
Section 5.4.1 Account record
I understood we'd agreed to drop this.
Section 5.4.2 Record container
I think this is mainly an artifact of the ASN syntax, and should be introduced
more briefly in the introductory section 5.1 (see previous comments)
Section 5.5.1 Attribute
I think the "optional-attribute-value" productions covered in section 5.2.1
(Entity) should be covered here since they apply to multiple record types.
I would prefer to see attribute names presented as being IRIs in the data model,
with the namespace-qualified CURIE syntax available as a convenience in the ASN
presentation.
I think the predefined attribute names should be dealt with in a separate
section. I'm actually not convinced this is the best design choice for
properties with DM-defined meaning, as opposed to (say) using separate record
parameters, but that's more of a style issue than a fundamental objection.
As indicated earlier, I think the whole discussion of derivation steps is too
much detail, and I don't see the value, and would suggest dropping the
prov:steps attribute.
For attribute prov:label: why not just use rdfs:label?
Section 5.5.2 Identifiers
The text says they are *qualified* names, but in most of the example they are
not. Also, some identifiers are described as having local scope: this is not
compatible with using *qualified* names which are essentially IRIs.
The text describes identifiers as denoting *records* (e.g. entity record) - I
think this is wrong, and in any case is inconsistent with text elsewhere in the
document. They should demote "entity", "activity", "agent", etc.
Section 5.5.3 Literal
"A PROV-DM Literal represents a value whose interpretation is outside the scope
of PROV-DM." What a Terrible Failure... the whole point of languages
introducing literals is precvisely that their interpretation *is* defined by the
language. If not, they might as well be names.
I think the intent is that their interpretation is defined by reference to the
corresponding xsd datatype definition, or some other datatype definition, that
is effectively incorporated by reference.
I'd suggest that an interpretation of literals is provided by:
- http://www.w3.org/TR/rdf-mt/#gddenot
- http://www.w3.org/TR/rdf-mt/#DTYPEINTERP
Section 5.5.4 Time
No syntax production provided or indicated.
I think it's unnecessary and inappropriate to indicate where time is used. It's
just something to go wrong as the document evolves.
Section 5.5.5 Asserter
Do we really still need this (now accounts are gone). Suggest dropping.
Section 5.5.6 Namespace
I'd suggest covering this with the introduction of the record container syntax
production
Section 5.5.7 Location
Do we have any explicit use of this? if not, I'd suggest dropping it.
...
I'm out of time and stopping my review here. There's a general pattern here
that I'd also apply to section 6.
I'd then take section 7 and (probably) exp[and it into several sections ("Part
2") introducing and describing a more formal treatment of provenance that can be
used to bridge from and refine the "scruffy" view to something that can be
assembled and processed according to inferences that flow from the formal
semantics. A key point to introduce here would be that it is possible to create
provenance statements that cannot possibly satisfy the formal semantics, and to
indicate what additional constraints and disciplines should be applied to ensure
that they can (and hence to make the inferences that flow from those semantics
valid).
#g
--
Received on Thursday, 23 February 2012 13:18:01 UTC