Re: Review of provenance model draft

Thanks Graham, for the extensive comments.

I raised issues on your behalf, since it's easier for us to discuss 
issues separately
and track them.

Luc

On 07/28/2011 10:38 PM, Graham Klyne wrote:
> With reference to:
> http://dvcs.w3.org/hg/prov/raw-file/default/model/ProvenanceModel.html
> Retrieved at about 17:30 on 28-Jul-2011
>
> As promised, I've taken a tilt at reviewing the model draft.  I must 
> say, I've found it to be really hard going - many of the notions 
> described are not making sense to me, and the language used sometimes 
> seems to be unnecessarily obscure.
>
> After a mammoth session going though this, I really don't have the 
> time or energy to split my comments out into separate issues.  I think 
> many of them are purely editorial in nature, and as such could be 
> cleaned up relatively easily. There are some substantive comments that 
> I may separate out as formal issues later, but I'm rather hoping that 
> won't be needed.
>
> My comments follow:
>
>
> 3.1 Notation used is obscure.  What does [...[ mean?  Should be 
> explained.
>
> For a general audience, examples based on Unix command shell commands 
> are probably not very helpful.
>
> What is "characterized entity represented by the file".  As this is an 
> example, just say "crime statistics" - would that be a correct 
> interpretation?
>
>
> 3.2 where did 'e0' come from? - it's not mentioned in 3.1.  What is it 
> intended to denote?
>
> The "agent" statements are completely impenetrable to me.
>
> How is the notation to be interpreted.  It looks a b it like some kind 
> of deviant Prolog, but either I've forgotten some of the basic 
> constructs, or it's not entirely clear how the deviant bits are meant 
> to be interpreted.
>
>
> 3.3 graphical representation: could be very useful, and would be much 
> easier to follow if the illustration included a key
>
> What does it mean for an agent to be linked to a BOB as opposed to a 
> process execution (cf. Alice and e0).
>
>
> 4. About the Provenance Language
>
> Introduction of "characterized entities" - if this is something that 
> really needs to be said, I think it needs to be clarified.  I spent 
> some time thinking about these two sentences, trying to work out if 
> they could ever be completely correct, or just not understanding what 
> they are intended to convey:
> [[
> Furthermore, this specification is concerned with characterized 
> entities, that is, entities and their situation in the world, as 
> perceived by their asserters.
>
> In the rest of the document, we are concerned with the representation 
> of such entities; their situation in the world will be represented 
> using sets of attributes.
> ]]
>
> Why "characterized entities" as opposed to perceived entities"?  
> What's the important distinction here?
>
> The only interpretation I've found that makes sense to me is that the 
> document is concerning itself with entities that are characterized by 
> the values of some bounded set of attributes.  But that 
> interpretation, if correct, is not obvious to me from the wording here.
>
>
> "PIL is a language by which representations of the world can be 
> expressed using terms that are drawn from a controlled vocabulary. "
> I'm not sure how to interpret this.  Does this "controlled vocabulary 
> include, for example, numbers? Is this controlled vocabulary expected 
> to be the complete set of terms used in PIL expressions?
>
>
> "These representations are relative to an asserter, and in that sense 
> constitute assertions about the world."
> What is this trying to say?  I think you might mean something like:
> "These representations are relative to the context of an asserter, and 
> in that sense constitute perceptions about the world."
> which ties back to the earlier statement about "as perceived by their 
> asserters".
>
> "All assertions in PIL SHOULD be interpreted as a record of what has 
> happened, as opposed to what may or will happen."
> I feel we should find a way to strengthen this SHOULD to a MUST, but 
> comments from earlier discussions make this tricky to get right.  Maybe:
> "All assertions in PIL MUST be interpreted as a record of what has 
> happened or been observed in some context, as opposed to what might 
> happen or potential observations."  In this, I am using the reference 
> to a context to provide just enough wiggle-room for description in 
> future or imagined contexts.
>
> "This specification does not prescribe the means by which assertions 
> are made, for example on the basis of observations, inferences, or any 
> other means."
> The phrasing "... assertions are made" here is jarring, if not 
> confusing - I would think that assertions are made in PIL for the 
> purposes of this spec. Suggest "... how assertions are arrived at, ..."
>
> "The language introduces a notion of "provenance container", which 
> provides a default scope for assertions."
> The term "container" here is suggested of a physical or logical 
> encapsulation, which I don't think is meant.  How about "provenance 
> context"?
>
> [[
> ... The model may define additional scoping rules for assertions. 
> Identifiers can safely be used within that scope. Optionally, 
> identifiers can be exported so that they can be used outside their 
> default scope. The language does not prescribe the mechanisms by which 
> identifiers are generated.
> ]]
> This spec is describing a data model, *not* a language.  It says so at 
> the top.  As such I think it's entirely inappropriate to start 
> defining linguistic constructs such as identifiers and scoping.  
> Assuming the actual language used will be RDF,  I'm not seeing how 
> what you describe will be possible.
>
> "In this specification, when an assertion is defined to refer to 
> another assertion about something, it does so by means of that thing's 
> identifier."
> I don't understand what this is trying to say.
>
>
> 5.1 BOB
>
> "A BOB represents an identifiable characterized entity."
>
> What does it mean to be "characterized" here?   What does this tell 
> us?  What does it mean to not be "characterized"?  If this refers to 
> the attribute-based assertions mentioned earlier, does this mean that 
> if there are no such assertions, an entity cannot be a "BOB"?
>
> [[
> A BOB assertion is about a characterized entity, whose situation in 
> the world is variant. A BOB assertion is made at a particular point 
> and is invariant, in the sense that all the attributes are assigned a 
> value as part of that assertion.
> ]]
>
> This section is, according to its heading, about "BOB".  But this is 
> defining a different concept, so shouldn't this be in a separate section?
>
> It seems to me that what we're talking about here is a "provenance 
> assertion". I think it would be clearer to just describe that, e.g.
> [[
> A provenance assertion is about an entity, whose situation in the 
> world is generally assumed to be variable.
> ]]
>
> I either don't understand or don't agree with the second part of that 
> description.  The notion of assigning values as party of an assertion 
> seems wrong to me (I think the notion of constraining attributes is 
> the job of the IVP-of relation).  I would expect something like:
> [[
> A provenance assertion is made at a particular point and is invariant, 
> in the sense that the attributes it mentions do not change for the 
> entity concerned.
> ]]
>
> [[
> A BOB assertion must describe a characterized entity over a continuous 
> time interval in the world (which may collapse into a single instant). 
> Characterizing an entity over multiple time intervals requires 
> multiple BOB assertions, each with its own identifier. Some attributes 
> may retain their values across multiple assertions.
> ]]
> This constraint seems rather unnecessary, and maybe counter-productive.
>
> Suppose we want to describe the collective observations of a 
> particular telescope when pointed at a particular region of the sky.  
> This might actually consist of  a (possibly unknown) number of 
> disjoint time-segments caused by the rotation of the earth and other 
> factors. I can't see any clear benefit in being forced to treat these 
> observation-sets as distinct entities.
>
> [[
> There is no assumption that the set of attributes is complete and that 
> the attributes are independent/orthogonal of each other.
> ]]
> I don't see this adding any useful information here.  Remove?
>
>
> 5.2 Process Execution
>
> Thinking about today's teleconference (28 July) and reading this, I'm 
> seeing the key distinction between Entity and Process execution being 
> like the philosophical distinction between continuants (endurant) and 
> occurrents (perdurant) 
> (http://en.wikipedia.org/wiki/Formal_ontology#Common_terms_in_formal_ontologies) 
>
>
>
> 5.3 Generation
>
> "characterized entitity" is clumsy - suggest just "entity" (or 
> whatever term is selected for "BOB").
>
> If I had not previously read about OPM, I'd be completely confused by 
> the introduction of "role" here.   Following the hyperlink here does 
> not help at all.
>
> [[
> Given an assertion isGeneratedBy(x,pe,r) or isGeneratedBy(x,pe,r,t), 
> the activity denoted by pe and the entities used by pe dermine values 
> of some of x's attributes.
> ]]
> I've no idea what this is trying to say.
>
>
> 5.4 Use
>
> Same problem with 'role' as above.
>
> [[
> A reference to a given BOB may appear in multiple use assertions that 
> refer to a given process execution, but each of those use assertions 
> must have a distinct role.
> ]]
> In light of the above, this seems nonsensical to me.
>
> [[
> Given an assertion uses(pe,x,r) or uses(pe,x,r,t), at least one value 
> of x's attributes is a pre-condition for the activity denoted by pe to 
> terminate.
> ]]
> As written this doesn't make sense - a value of an attribute being a 
> precondition seems like a type error to me.  I think you mean 
> something like availability of an attribute value.  But even that is 
> hard to follow.  Suggest simplifying this to just:
> [[
> Given an assertion uses(pe,x,r) or uses(pe,x,r,t), existence of x is a 
> pre-condition for the activity denoted by pe to terminate.
> ]]
>
>
> 5.5 Derivation
>
> [[
> Given an assertion isDerivedFrom(B,A), one can infer that the use of 
> characterized entity denoted by A precedes the generation of the 
> characterized entity denoted by B.
> ]]
> Where does this notion of "use" come from in the absence of some 
> referenced activity?
>
> Concerning transitivity of derivation:
>
> Suppose:
> A has attributes a0, a1
> B having attributes b0, b1 is derived from A, with b0 being dependent 
> on a0
> C having attributes c0, c1, is derived from B with c1 being dependent 
> on b1
>
> So none of the attributes of C can be said to be directly or 
> indirectly dependent on attributes of A, which by the given definition 
> is a requirement for derivation of C from A.  Thus, as defined, 
> derivation cannot be transitive.
>
> I don't really know if derivation should or should not be transitive, 
> but the above seems to me like a problem of spurious 
> over-specification.   My suggestion for now would be to focus on what 
> really matters and see what logical properties fall out later.
>
>
> 5.8 IVP of
>
> The revised (w.r.t. 
> http://www.w3.org/2011/prov/wiki/F2F1ConceptDefinitions#IVP_of) 
> treatment of IVP-of, and relabeling as "complement-of" completely 
> overturns my understanding of what this was intended to capture. I 
> understood the whole point of A IVP-of B was intended to capture the 
> notion that A denotes a contextually constrained form of the entity 
> denoted by B.  I don't see what useful purpose this relation serves.
>
> From a practical perspective, given the asymmetric nature of IVP-of 
> (as was) it is easy to express the effect of complement-of in RDF by 
> introducing a new entity node.  But I see no way of constructing the 
> strict constraining role of IVP using complement-of.
>
>
> 5.9 Time
>
> [[
> Time is defined according to [ISO8601].
> ]]
>
> I don't think it is appropriate of an open standard to be normatively 
> dependent on a standard that is available only on payment of a charge 
> for access.  In this case, we could make reference to the XML scheme 
> datatypes, which would also require us to think about my next point...
>
> As far as I'm aware, ISO 8601 covers both points in time and time 
> intervals.  As such a bare reference to ISO 86012 is not really an 
> adequate definition:  which do we want?  I suspect 
> http://www.w3.org/TR/xmlschema-2/#dateTime.
>
>
> 5.10 Recipe Link
>
> I don't see what useful purpose this serves.
>
>
> 5.11 Role
>
> I can't completely follow the description given.
>
>
> 5.13 Ordering of Processes
>
> This section confusingly changes the style of presentation from 
> sections dedicated to specific concepts to a vague discussion of 
> possible relationships between things.
>
>
> 5.14 Revision
>
> This seems to be just a different form of Derivation that happens to 
> mention an agent.  I'm not sure why I'd choose one over the other.
>
> I think this may be unnecessary - would not a similar effect be 
> achieved by having a process execution of "revision" that uses b1, 
> generates b2 and is controlled by ag (possibly with role "revise"?).
>
>
> 5.16 Provenance Container
>
> It's not clear what this is intended to be (maybe unsurprising, since 
> the definition is absent).  But it looks as if it's intended to a 
> syntactical kind of thing, which I feel is out of place in a data 
> model description (especially if we're expecting to use RDF to 
> represent the data).  The next version of RDF will probably formally 
> define named graphs - I'm not seeing what additional definition would 
> be needed here.
>
>

-- 
Professor Luc Moreau
Electronics and Computer Science   tel:   +44 23 8059 4487
University of Southampton          fax:   +44 23 8059 2865
Southampton SO17 1BJ               email: l.moreau@ecs.soton.ac.uk
United Kingdom                     http://www.ecs.soton.ac.uk/~lavm

Received on Friday, 29 July 2011 09:18:05 UTC