Review of provenance model draft from Graham Klyne on 2011-07-28 (public-prov-wg@w3.org from July 2011)

From: Graham Klyne <graham.klyne@zoo.ox.ac.uk>
Date: Thu, 28 Jul 2011 22:38:07 +0100
To: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4E31D6BF.7070105@zoo.ox.ac.uk>
With reference to:
http://dvcs.w3.org/hg/prov/raw-file/default/model/ProvenanceModel.html
Retrieved at about 17:30 on 28-Jul-2011

As promised, I've taken a tilt at reviewing the model draft.  I must say, I've 
found it to be really hard going - many of the notions described are not making 
sense to me, and the language used sometimes seems to be unnecessarily obscure.

After a mammoth session going though this, I really don't have the time or 
energy to split my comments out into separate issues.  I think many of them are 
purely editorial in nature, and as such could be cleaned up relatively easily. 
There are some substantive comments that I may separate out as formal issues 
later, but I'm rather hoping that won't be needed.

My comments follow:


3.1 Notation used is obscure.  What does [...[ mean?  Should be explained.

For a general audience, examples based on Unix command shell commands are 
probably not very helpful.

What is "characterized entity represented by the file".  As this is an example, 
just say "crime statistics" - would that be a correct interpretation?


3.2 where did 'e0' come from? - it's not mentioned in 3.1.  What is it intended 
to denote?

The "agent" statements are completely impenetrable to me.

How is the notation to be interpreted.  It looks a b it like some kind of 
deviant Prolog, but either I've forgotten some of the basic constructs, or it's 
not entirely clear how the deviant bits are meant to be interpreted.


3.3 graphical representation: could be very useful, and would be much easier to 
follow if the illustration included a key

What does it mean for an agent to be linked to a BOB as opposed to a process 
execution (cf. Alice and e0).


4. About the Provenance Language

Introduction of "characterized entities" - if this is something that really 
needs to be said, I think it needs to be clarified.  I spent some time thinking 
about these two sentences, trying to work out if they could ever be completely 
correct, or just not understanding what they are intended to convey:
[[
Furthermore, this specification is concerned with characterized entities, that 
is, entities and their situation in the world, as perceived by their asserters.

In the rest of the document, we are concerned with the representation of such 
entities; their situation in the world will be represented using sets of attributes.
]]

Why "characterized entities" as opposed to perceived entities"?  What's the 
important distinction here?

The only interpretation I've found that makes sense to me is that the document 
is concerning itself with entities that are characterized by the values of some 
bounded set of attributes.  But that interpretation, if correct, is not obvious 
to me from the wording here.


"PIL is a language by which representations of the world can be expressed using 
terms that are drawn from a controlled vocabulary. "
I'm not sure how to interpret this.  Does this "controlled vocabulary include, 
for example, numbers? Is this controlled vocabulary expected to be the complete 
set of terms used in PIL expressions?


"These representations are relative to an asserter, and in that sense constitute 
assertions about the world."
What is this trying to say?  I think you might mean something like:
"These representations are relative to the context of an asserter, and in that 
sense constitute perceptions about the world."
which ties back to the earlier statement about "as perceived by their asserters".

"All assertions in PIL SHOULD be interpreted as a record of what has happened, 
as opposed to what may or will happen."
I feel we should find a way to strengthen this SHOULD to a MUST, but comments 
from earlier discussions make this tricky to get right.  Maybe:
"All assertions in PIL MUST be interpreted as a record of what has happened or 
been observed in some context, as opposed to what might happen or potential 
observations."  In this, I am using the reference to a context to provide just 
enough wiggle-room for description in future or imagined contexts.

"This specification does not prescribe the means by which assertions are made, 
for example on the basis of observations, inferences, or any other means."
The phrasing "... assertions are made" here is jarring, if not confusing - I 
would think that assertions are made in PIL for the purposes of this spec. 
Suggest "... how assertions are arrived at, ..."

"The language introduces a notion of "provenance container", which provides a 
default scope for assertions."
The term "container" here is suggested of a physical or logical encapsulation, 
which I don't think is meant.  How about "provenance context"?

[[
... The model may define additional scoping rules for assertions. Identifiers 
can safely be used within that scope. Optionally, identifiers can be exported so 
that they can be used outside their default scope. The language does not 
prescribe the mechanisms by which identifiers are generated.
]]
This spec is describing a data model, *not* a language.  It says so at the top. 
  As such I think it's entirely inappropriate to start defining linguistic 
constructs such as identifiers and scoping.  Assuming the actual language used 
will be RDF,  I'm not seeing how what you describe will be possible.

"In this specification, when an assertion is defined to refer to another 
assertion about something, it does so by means of that thing's identifier."
I don't understand what this is trying to say.


5.1 BOB

"A BOB represents an identifiable characterized entity."

What does it mean to be "characterized" here?   What does this tell us?  What 
does it mean to not be "characterized"?  If this refers to the attribute-based 
assertions mentioned earlier, does this mean that if there are no such 
assertions, an entity cannot be a "BOB"?

[[
A BOB assertion is about a characterized entity, whose situation in the world is 
variant. A BOB assertion is made at a particular point and is invariant, in the 
sense that all the attributes are assigned a value as part of that assertion.
]]

This section is, according to its heading, about "BOB".  But this is defining a 
different concept, so shouldn't this be in a separate section?

It seems to me that what we're talking about here is a "provenance assertion". 
I think it would be clearer to just describe that, e.g.
[[
A provenance assertion is about an entity, whose situation in the world is 
generally assumed to be variable.
]]

I either don't understand or don't agree with the second part of that 
description.  The notion of assigning values as party of an assertion seems 
wrong to me (I think the notion of constraining attributes is the job of the 
IVP-of relation).  I would expect something like:
[[
A provenance assertion is made at a particular point and is invariant, in the 
sense that the attributes it mentions do not change for the entity concerned.
]]

[[
A BOB assertion must describe a characterized entity over a continuous time 
interval in the world (which may collapse into a single instant). Characterizing 
an entity over multiple time intervals requires multiple BOB assertions, each 
with its own identifier. Some attributes may retain their values across multiple 
assertions.
]]
This constraint seems rather unnecessary, and maybe counter-productive.

Suppose we want to describe the collective observations of a particular 
telescope when pointed at a particular region of the sky.  This might actually 
consist of  a (possibly unknown) number of disjoint time-segments caused by the 
rotation of the earth and other factors. I can't see any clear benefit in being 
forced to treat these observation-sets as distinct entities.

[[
There is no assumption that the set of attributes is complete and that the 
attributes are independent/orthogonal of each other.
]]
I don't see this adding any useful information here.  Remove?


5.2 Process Execution

Thinking about today's teleconference (28 July) and reading this, I'm seeing the 
key distinction between Entity and Process execution being like the 
philosophical distinction between continuants (endurant) and occurrents 
(perdurant) 
(http://en.wikipedia.org/wiki/Formal_ontology#Common_terms_in_formal_ontologies)


5.3 Generation

"characterized entitity" is clumsy - suggest just "entity" (or whatever term is 
selected for "BOB").

If I had not previously read about OPM, I'd be completely confused by the 
introduction of "role" here.   Following the hyperlink here does not help at all.

[[
Given an assertion isGeneratedBy(x,pe,r) or isGeneratedBy(x,pe,r,t), the 
activity denoted by pe and the entities used by pe dermine values of some of x's 
attributes.
]]
I've no idea what this is trying to say.


5.4 Use

Same problem with 'role' as above.

[[
A reference to a given BOB may appear in multiple use assertions that refer to a 
given process execution, but each of those use assertions must have a distinct role.
]]
In light of the above, this seems nonsensical to me.

[[
Given an assertion uses(pe,x,r) or uses(pe,x,r,t), at least one value of x's 
attributes is a pre-condition for the activity denoted by pe to terminate.
]]
As written this doesn't make sense - a value of an attribute being a 
precondition seems like a type error to me.  I think you mean something like 
availability of an attribute value.  But even that is hard to follow.  Suggest 
simplifying this to just:
[[
Given an assertion uses(pe,x,r) or uses(pe,x,r,t), existence of x is a 
pre-condition for the activity denoted by pe to terminate.
]]


5.5 Derivation

[[
Given an assertion isDerivedFrom(B,A), one can infer that the use of 
characterized entity denoted by A precedes the generation of the characterized 
entity denoted by B.
]]
Where does this notion of "use" come from in the absence of some referenced 
activity?

Concerning transitivity of derivation:

Suppose:
A has attributes a0, a1
B having attributes b0, b1 is derived from A, with b0 being dependent on a0
C having attributes c0, c1, is derived from B with c1 being dependent on b1

So none of the attributes of C can be said to be directly or indirectly 
dependent on attributes of A, which by the given definition is a requirement for 
derivation of C from A.  Thus, as defined, derivation cannot be transitive.

I don't really know if derivation should or should not be transitive, but the 
above seems to me like a problem of spurious over-specification.   My suggestion 
for now would be to focus on what really matters and see what logical properties 
fall out later.


5.8 IVP of

The revised (w.r.t. 
http://www.w3.org/2011/prov/wiki/F2F1ConceptDefinitions#IVP_of) treatment of 
IVP-of, and relabeling as "complement-of" completely overturns my understanding 
of what this was intended to capture. I understood the whole point of A IVP-of B 
was intended to capture the notion that A denotes a contextually constrained 
form of the entity denoted by B.  I don't see what useful purpose this relation 
serves.

 From a practical perspective, given the asymmetric nature of IVP-of (as was) it 
is easy to express the effect of complement-of in RDF by introducing a new 
entity node.  But I see no way of constructing the strict constraining role of 
IVP using complement-of.


5.9 Time

[[
Time is defined according to [ISO8601].
]]

I don't think it is appropriate of an open standard to be normatively dependent 
on a standard that is available only on payment of a charge for access.  In this 
case, we could make reference to the XML scheme datatypes, which would also 
require us to think about my next point...

As far as I'm aware, ISO 8601 covers both points in time and time intervals.  As 
such a bare reference to ISO 86012 is not really an adequate definition:  which 
do we want?  I suspect http://www.w3.org/TR/xmlschema-2/#dateTime.


5.10 Recipe Link

I don't see what useful purpose this serves.


5.11 Role

I can't completely follow the description given.


5.13 Ordering of Processes

This section confusingly changes the style of presentation from sections 
dedicated to specific concepts to a vague discussion of possible relationships 
between things.


5.14 Revision

This seems to be just a different form of Derivation that happens to mention an 
agent.  I'm not sure why I'd choose one over the other.

I think this may be unnecessary - would not a similar effect be achieved by 
having a process execution of "revision" that uses b1, generates b2 and is 
controlled by ag (possibly with role "revise"?).


5.16 Provenance Container

It's not clear what this is intended to be (maybe unsurprising, since the 
definition is absent).  But it looks as if it's intended to a syntactical kind 
of thing, which I feel is out of place in a data model description (especially 
if we're expecting to use RDF to represent the data).  The next version of RDF 
will probably formally define named graphs - I'm not seeing what additional 
definition would be needed here.
Received on Thursday, 28 July 2011 21:38:41 UTC