Comments on the current data model from Simon Miles on 2011-09-24 (public-prov-wg@w3.org from September 2011)

From: Simon Miles <simon.miles@kcl.ac.uk>
Date: Sat, 24 Sep 2011 17:55:41 +0100
To: Provenance Working Group WG <public-prov-wg@w3.org>
Message-ID: <CAKc1nHd7KZPnS6B789+RydD-HWn1VF_OXbrx5JAgQsdPCVguww@mail.gmail.com>
Luc, Paolo,

Here's my comments on the current data model document, annotated with
(T) for typo/text clarity or (C) for content comment/question. I think
most/all comments are small enough that an issue need not be raised.

Throughout:
(T) Sections are referred to in the text by "Section Entity", "Section
Process Execution" etc. Shouldn't these be the section numbers?
(T) There seems to be inconsistency in symbols following the change
from roles to qualifiers. Sometimes "q" is used in constraint
definitions, examples etc. and sometimes "r" is used. I suggest it
would be clearer to always use "q".
(T) There are a few "characterised" in amongst the majority
"characterized" spelling.
(C) At least one standard qualifier name, "role", is used in the
document, but it is not clear what namespace this name is in. Does it
mean no other "role"s from domain-specific ontologies may be used in
Prov data?

Sec 2.1:
(T) paragraph 1: "Words such thing or activity" should be "Words such
as 'thing' or 'activity'"
(C) paragraph 2: The first mention of "provenance" in the document
proper is in the second paragraph of this section, and is a bit out of
the blue ("unambiguously report provenance"). Can we add some
intuition about what provenance is (for this data model)?
(T) Example paragraph 1: "perspectives about a resource" should be
"perspectives on a resource"
(C) Example paragraph 1: "the report independent of where it is hosted
over time" - I suggest also saying "and of its content over time", to
distinguish this entity from the report version entity above it
(C) paragraph 6: "punctual events"? "punctual" as most commonly used
implies prior planning of when something should occur. I'm not sure
what you are intending in this context.
(C) paragraph 6: "a partial order exists between events". I assume you
mean a temporal order? What kind of ordering do you mean?
(C) paragraph 6: "global notion of time and Lamport's style clocks" -
this seems like a weirdly specific level of detail for this overview
section, especially considering that many other aspects of the model
are not mentioned at all in the overview.

Sec 2.3:
(C) Regarding the note (not attempting to ensure consistency of an
asserter) - this seems practical. I'm not sure how we could enforce
consistency in any circumstance, only define what it means or say it
is application specific.

Sec 4.1:
(T) "We denote this e1." and the same for e2 etc. It is not entirely
clear whether "this" refers to the event or the entity.

Sec 4.2:
(C) The fact that Alice is the creator of e1 seems to be expressed
twice, first as an attribute "creator=Alice", and secondly as the
"creator" role of an agent in the creation process. I don't think it
is a good idea for either clarity of use of the model or for ensuring
interoperability for there to be multiple ways to express the same
thing, if it can be at all avoided. Even if we cannot stop someone
using either method, can't we say which they *should* use to aid
interoperability?
(T) "Generation expressions... represent the event at which a file is
created". The surrounding text is generic rather than specific to the
example, implying this should be "entity" rather than "file",
Otherwise, readers may assume that all entities are files or that
generation only applies to files.
(T) Paragraph on wasComplementOf: in "attribute content" and
"attribute spellchecked", fixed width font (or another font) should be
used for the attribute names to show they are names, else the sentence
can be read in strange ways.

Sec 4.3:
(T) Fig 1: The arrow from pe2 to a3 is a different direction to the
other "agent" links. It is also not clear if an "agent" link is the
same as a "wasControlledBy" link. If so, the pe2-a3 arrow direction
makes most sense, as the others seem to be saying the agent was
controlled by the process execution.

Sec 5.1:
(T) The last sentence, regarding a "house-keeping construct" is rather
opaque. I'm not sure what the reader is supposed to understand from
this.

Sec 5.2.1:
(C) First sentence: "entity expression" is given exactly the same
definition that "entity" was in Section 4. I think having two terms
for the same thing will cause confusion. I like addition of
"expressions" to the model in general, though, as I think this greatly
clarifies what is intended.
(C) "the meaning of attribute in the context of a process execution
expression is similar to the meaning of attribute for entity
expression" - I think the meaning should be exactly the same, not just
similar, else there will be confusion.
(C) Following from the above point: "A process execution expression's
attribute remains constant for the duration of the activity" - OK, but
does it also characterise the process execution, e.g. is the start
time part of what distinguishes one execution from others?
(T) "noted processExecution" - I think you mean "denoted" (or
"written" or "expressed")

Sec 5.2.3:
(T) "representation a characterized thing" - missing "of"
(T) Last sentence, "On the contrary" should be "On the other hand",
and "inferred" should be "infer"

Sec 5.2.4:
(T) Last sentence: "expectede"

Sec 5.3.3.1:
(C) I suggest that, as accounts are not introduced until later in the
document, the generation-unicity constraint will not make sense here.
Moreover, I think the constraint is more about accounts and what it
means for them to be consistent than it is about generation events or
process executions. Therefore, I suggest moving this constraint to the
section on accounts.
(C) Given that constraint derivation-events applies, don't we just
have two ways of saying the same thing? Why use the long form of
wasDerivedFrom when the same can be expressed using wasGeneratedBy and
used? Which variety *should* be used?

Sec 5.3.3.2:
(T?) Constraint "derivation-linked-independent" seems to be a
tautology. I guess this is a typo?

Sec 5.3.3.3:
(T) Paragraph 4: "In other word" should be "In other words"

Sec 5.3.4:
(C) This section seems to be confusingly expressed, implying that
non-agent entities can control executions, whereas the control-agent
constraint (in the section on agents) contradicts this. It is probably
just a matter of clarifying the text, e.g. if you mean that a
non-agent entity can be asserted to be controlling an execution but
from this inferred to be an agent.
(T) The text may be read to imply that a control link has only one
qualifier, role, whereas I guess you mean that, like use/generate, it
can have multiple "modalities" as part of the qualifier?

Sec 5.3.5:
(C) I can see this section causing some difficulty... While that may
just be the nature of the topic, there seems an important thing
missing: what has complementarity got to do with provenance? In other
words, what value (with regards to provenance) is there in asserting
complementarity?
(C) The text suddenly starts talking about "properties" from the
second paragraph. What are these, and do they have any relation to
attributes?
(C) Should the justification of why the complementarity relation is
not transitive be in this document? I would expect this document to
just state that it is not transitive and, for brevity and simplicity,
leave justifications to another document.

Sec 5.3.6:
(C) Similarly to above, I'm not sure the justification of why
wasInformedBy is not transitive should be in this document.

Sec 5.3.8:
(C) Constraint participation: This seems odd to me. In what
circumstances would you not know or want to assert which of the three
possibilities (used/controlled/complement) applied for a given entity
and execution? Is hadParticipant as defined really useful?

Sec 5.3.9:
(C) Grammar definition: I don't understand what the
"relationIdentification" stuff is about or what all the identifiers
identify.

Sec 5.4.1:
(C) This appears to be yet another way to say the same thing,
following the comment on Sec 4.2 above. If A is an "asserter" of
expression E, then we can either (i) express E to be an entity and use
an attribute "asserter=E"; (ii) express E to be an entity and A to be
an agent playing "role=asserter"; or (iii) put A in the "asserter"
slot of an "account" expression containing E. Why do we need all three
ways? Isn't method (ii) most consistent with the rest of the model?

Sec 5.4.2:
(T) Second sentence: "return all the provenance assertions" - all the
assertions? or just "all the assertions in the container"?
(C) Under the definition given, you cannot have expressions in a
container but not in an account. Does this imply that every Prov
expression is made accessible as part of an account? I think this
would be a good thing for clarity, but it is not explicit in the
document (and also differs from OPM).

Section 5.5.1:
(C) I agree with the first note. If it is mandatory to say something
but that what we say can be nothing, that means that it is not
mandatory at all. The "mandatory" thing seems to be just saying
something about the ASN, and so is irrelevant as the ASN is just there
to make the model concrete and readable.

Sec 5.5.4:
(C) Second note: Wouldn't this mean that either account IDs or entity
IDs can never be URIs, as a sequence of URIs would itself not be a
URI? If so, that seems to make RDF serialisation difficult to achieve.

Sec 5.5.6:
(C) I don't see the connection between the section's introductory text
and the content of the subsections.

Sec 5.7.1:
(C) I think this section needs something introductory to say why it is
relevant to the data model, i.e. what has it to do with provenance,
why is it useful in the context of provenance, why is it standardised
rather than application-specific?
(C) If my record of what occurred does not start with an empty
container, but one with contents, how do I say that the elements are
part of the container? Do I have to model this as a series of
wasAddedTo links, even if I know nothing about how the elements were
added? Or is it out of scope of the standard?

Sec 5.7.2:
(C) I don't see how wasQuoteOf is a sub-relation of wasRevisionOf, or
wasAttributedTo a sub-relation of wasEventuallyDerivedFrom, when the
super-relations do not contain reference to any agents but the
sub-relations do. What does it mean?
(T) Last sentence of 5.7.2.2: "wasQuoteOf" should be "wasAttributedTo"

Thanks,
Simon

-- 
Dr Simon Miles
Lecturer, Department of Informatics
Kings College London, WC2R 2LS, UK
+44 (0)20 7848 1166
Received on Saturday, 24 September 2011 16:56:09 UTC