Provenance model document is over-complicated and hard to understand from Graham Klyne on 2011-08-25 (public-prov-wg@w3.org from August 2011)

From: Graham Klyne <GK@ninebynine.org>
Date: Thu, 25 Aug 2011 18:03:44 +0100
To: W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <4E568070.7060708@ninebynine.org>
With reference to:

http://dvcs.w3.org/hg/prov/raw-file/71fa5079e6b3/model/ProvenanceModel.html

(Note this is the Mercurial version at 25-Aug-2011, about 16:30UTC)

...

This is my second attempt to read this document, and while I have slightly 
better comprehension than first time round, I'm still finding it very hard 
going.  Rather than go through many small points, I'm going to try to focus on 
high-level issues.

I am really concerned that the direction of development of this specification 
will make it unusable for (what I take to be) its intended purpose, which is to 
provide a reference for developers who are generating and consuming provenance 
information.  For this, the what is needed first and foremost is a simple data 
model with a straightforward mapping to one or more common web representation. 
 From the WG charter, I expect RDF to be one of those.  If this is not provided, 
then I fear the specification will render itself irrelevant.

I find the example in section 3, and the examples associated with the individual 
definitions, are really unhelpful.  In particular, the "Abstract Syntax 
Notation" is used without any explanation of what it means.  (There is an 
appendix with a BNF syntax for ASN, but no explanation of how to read or 
interpret it.)

What I think would be *far* more useful than section 3 in its current form is a 
"50,000-foot" view of the model, outlining the main classes and their possible 
relationships.  A very nice example of such a view for a simpler provenance 
ontology (OPMV) can be seen in 
http://open-biomed.sourceforge.net/opmv/ns.html#sec-desc.  Coupled with a 1-2 
line summary of each class and relation, this would go a long way to making the 
whole framework more approachable.

In section 3.3, the diagram key shows what the different line styles represent, 
but not the different shapes.

...

5.1 Entity

In sections 4 and 5.1, I think the definition of Entity, and its distinction as 
a "characterized thing" is unhelpful and confusing.  It is not clear to me what 
the distinction of being "characterized" means in any formal sense.  There has 
been a lot of discussion on the mailing list about this, but I haven't see a 
single argument that shows why it is *needed* to distinguish an "Entity" in any 
way from anything that can be identified; i.e. any web resource.

It seems to me that the primitive provenance assertions one might make about an 
"Entity" (e.g. dc:creator, doap:release, to use examples from OPMV) can be made 
regardless of any claim that the Entity is a "characterized thing".  Similarly, 
I see no breakage arising from appropriate application of provenance relations 
(such as derivedFrom, generatedBy, etc.) to arbitrary entities.  Using RDF, the 
normal approach would be to infer from the existence of such relations some 
information about the type of the things they relate.  If they are used 
inappropriately, the resulting expression may disagree with reality, or be 
unsatisfiable - we can't stop people from making nonsense statements on the web.

....

5.8 isComplementOf

I struggle to understand from the text what practical application there is for 
the isComplementOf relation.  From mailing list discussion, I come to an 
understanding that it might be useful for talking about different provenance 
accounts, to understand when they are referring to the same underlying thing 
("Royal Society", etc.)

But I find the description given is somewhat complicated and confusing, in part 
because of the enforced distinction between things and "Entities".  I find it 
much more natural to think of views (i.e. where I think we started out with 
"IVPof") where different accounts may use different views (roughly corresponding 
to different observational constraints) of some thing, which can themselves be 
seen as things (e.g. in the examples given M1, M2, etc might be things 
corresponding to "Royal Society" in the period(s) when its membership was as 
indicated in each case, or L1, L2, etc, corresponding to it being located at 
specific places.  The "established on" view could be considered as the 
underlying "Royal Society" itself, as that never changes.

 From this notion of "view", the "complementOf" relation is easily derived:  if 
two things are both views of some underlying thing, then they are complementOf 
each other.

This all seems much simpler, more intuitive and less complicated to me than the 
contortions used to explain complementOf in terms of "characterized entities" 
and attributes.

....

5.16 Provenance container

Leaving aside, for now, the notion of accounts, I find this concept completely 
unnecessary.  Further, it is defined as having a set of "provenance constructs", 
but I see no defined concept for these, so the definition given is incomplete.

I think it would be easier to have a "Provenance expression" (roughly 
corresponding to an RDF expression) that is a provenance assertion about some 
thing or things, which can be evaluated to be true or false.  At this level, I 
see no need for having any kind of visibility of the inside of such an 
expression looks like.  The main requirement to be a valid provenance expression 
is that it is not dependent on any ways in which the referenced thing or things 
may vary (i.e. talks only about invariant aspects of the thing(s)).

I accept some form of containment is needed for accounts, as it is important for 
some purposes to be able to consider them separately - but I can't see why that 
containment isn't just part of what it is to be an account.

My message to Daniel 
(http://lists.w3.org/Archives/Public/public-prov-wg/2011Aug/0242.html) expands a 
little on how I see this.

....

In summary:  I've tried to focus in high-level issues that I think make this 
provenance model document difficult to understand, hence difficult to use in 
discussion of other areas of the provenance WG discussion, and which I also fear 
may have the effect of it being ignored by developers in favour of simpler 
specifications like OPMV:
- I think a short, high-level overview is needed to illustrate how the various 
ideas work together
- I think that too much formality is invoked too early on in the document, and 
that the formalism used is inadequately described. Also, I note that the 
formalism is applied to the examples, not to the definitions themselves, which 
seems a little odd to me.
- I think the definition of "Entity" is unnecessarily complex, and has been 
giving rise to much confusion in working group discussions.
- I think the definition of "Entity" gives rise to an over-complicated 
definition of "complementOf".  I think the original notion of "IPVof" was more 
useful, and got lost along the way.
- I think the notion of provenance containers as a separate concept is not 
needed, and that a simpler concept of provenance expression (or assertion), 
which is just another "thing", could be all that is needed.  Accounts could then 
be implicit containers for provenance expressions.

#g
--
Received on Thursday, 25 August 2011 17:05:25 UTC