Reviewing PROV-DM

I've started another full review of PROV-DM, so far having done up to about 
section 5.4.

While the content is making much more sense than it did last time I reviewed it, 
I am finding some of the text to be repetitive, confusing and in some cases 
strangely phrased.  I think a main goal of this document needs to be to offer an 
approachable description of the underlying data model and ASN notation that can 
be used by developers and information designers.  I think the document could 
benefit from a serious round of sub-editing (without intending to change the 
substantive content).

I also think that a refactoring of the DM concepts (without fundamentally 
changing the underlying intended semantics) could help to eliminate a lot of 
repetitive text.  These comments relate to the recent "domain of discourse" 
vote, but I'm coming at this from a more holistic perspective.

It seems to me that the domain of discourse contains the following concepts:
   Entity
   Activity
   Agent
   Event
   Plan
   Account
in that these are the various things about which the provenance language aims to 
make assertions, and that all of these could be considered types of Entity (with 
the possible exception of Event).  I think we've already established that most 
if not all of these are kinds of entity.

If the descriptions were refactored around such a structure, I believe much of 
the repetitive description of attributes could be focused in one place.  I would 
be inclined to separate attributes from the other type declarations, so we'd end 
up with primitive ASM expressions like these:

   Entity(id)
   Activity(id, start?, end?)
   Agent(id)
   Plan(id)
   Event(Id, time?)
   Account(id)
   Attributes(id, [attr1=val1, attr2=val2, ...])

Where the Attributes expression could be applied to any of the preceding 
concepts, and the description of attributes would consequently be needed only 
once.  The main objection I see to this is that it would mean that, say, the ASN 
expression:

   Entity(id, [attr1=val1, attr2=val2, ...])

would be replaced by two expressions:

   Entity(id)
   Attributes(id, [attr1=val1, attr2=val2, ...])

I would counter this by having the ASN (but not the underlying model) allow the 
first form as a syntactic sugar for the second.

...

I also felt that the handling of Activity start and end was not consistent: 
according to the text, the times given correspond to Events.  So why not have 
them *be* Events - that would mean we have a total of 6 event types rather than 
just 4, but the description of the "Lamport clock" timelines could be focused on 
the description of Event alone.

...

I think all of this could be done with minimal change to the underlying 
semantics, and that coupled with a significant round of sub-editing and 
reorganization of some of the text could lead to a document that is much easier 
to follow.

#g
--

Received on Monday, 30 January 2012 11:08:02 UTC