Issues of PROV Constraints from Antoine Zimmermann on 2012-10-22 (public-prov-comments@w3.org from October 2012)

From: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Date: Mon, 22 Oct 2012 15:04:19 +0200
To: public-prov-comments@w3.org
Message-ID: <50854453.2010406@emse.fr>
Dear PROV working group,



I'm sorry to send my comments so late, but I hope you will take a close 
attention to what I'm saying here.


I've read PROV data model and PROV constraints and while I find PROV-DM 
satisfying, I have concerns about PROV Constraints.

The document tries to define a kind of semantics for the PROV-DM, but 
it's not actually defining a proper semantics at all. And since it does 
not truly define a formal semantics, it has to redefine the terms that 
are borrowed from logic, and subtly diverge from the standard notions 
found in logic. For someone who knows about logic, it makes the document 
larger, more complicated than it should be, more difficult to read, and 
more difficult to implement. For someone who does not know about logic, 
it's not made at all simpler, on the contrary.

To give an example before getting into the details, let us examine the 
notion of PROV-equivalence, in Section 2.4. Equivalence is a well known 
and understood notion in maths and logics, which is necessarily 
reflexive. In PROV, equivalence isn't reflexive! For X to be 
PROV-equivalent to Y, it is required that both X and Y are PROV-valid, 
which is the PROV term used to say "consistent".

If a proper semantics was defined, there would be no way the document 
diverge from legacy logical concepts. There would be less chances for 
mistakes in the definitions and in the rules laters. There would be more 
flexibililty in the way implementations support the semantics.

I see two escapes for this:
  1)  define an ad hoc model theory for PROV-DM, in the line of what's 
sketched in PROV-SEM;
  2)  rely on a well known logic (such as FOL), to which all PROV 
statements are mapped, and add FOL axioms to model the constraints.

1) is ambitious and I understand there is not enough time to investigate 
it enough. But 2) is really easy. PROV statements already look a lot 
like FOL predicates, and PROV Constraints already rely heavily on FOL 
notions and notations.

Definitions and inferences of 4.1 are not needed if the semantics is 
defined on the abstract syntax. The abstract syntax is not PROV-N and 
does not have "syntactic sugar" or "syntactic shortcuts". Meaning, 
relations in abstract syntax always contain all the arguments, optional 
or not, possibly with existential variables. The short forms only exist 
in the surface syntax.

Inferences in 4.2, 4.3, 4.4 and 4.5 are all writable as FOL axioms.

Constraints require equality and an infinite total order (Time with 
order "precedes"). This does not make the logic more complicated than FOL.


Detailed comments:
-----------------

Section 1.2:
"""
We define validity and equivalence in terms of a concept called 
normalization.
"""

This is not required and suggests that implementations have to rely on 
normalisation. The choice of formalisation, being entirely procedural, 
imposes that normalisation is explicitely mentioned in almost all 
definitions, which in turn asks for repeatedly marking that 
implementations are not forced to produce normal forms. This obfuscates 
the document, which could be smaller and simpler.


"""
Definitions, inferences, and uniqueness constraints can be applied to 
normalize PROV instances, and event ordering, typing, and impossibility 
constraints can be checked on the normal form to determine validity. 
Equivalence of two PROV instances can be determined by comparing their 
normal forms.
"""

This also strongly suggests normalising by "applying" constraints, as 
later defined, although it is said later that normalisation is not 
required in implementations.


"""
For PROV documents, validity and equivalence amount to checking the 
validity or pairwise equivalence of their respective documents.
"""

should be: "... of their respective instances."


"""
This specification defines how validity and equivalence are to be 
checked, if an application elects to support them at all.
"""

It was said before that implementations can choose the way they check 
validity or equivalence. Again, this is a very strong suggestion that 
implementation should be made according to the presented procedures.


"""
Applications [...] should treat equivalent instances or documents in the 
same way.
"""

Does it mean that an instance in a bundle can be safely replaced by 
another equivalent instance inside the bundle?  This is important 
because it may be understood that a bundle identifier identifies the 
exact set of PROV statements given in a PROV document, or that the 
bundle identifier simply identifies the logical content of the bundle, 
or even any content that logically "implies" the content of the bundle 
in the said document.
The definitions of equivalence and inference suggests that it is the 
later option. Note that this is something that may have to be understood 
when writing PROV documents as RDF datasets, given that RDF 1.1 is not 
specifying any particular relationship between a graph IRI and the 
content of a named graph.


Section 2.4:

"""
Merging

Merging is an operation that takes two terms and compares them to see if 
they are equal, or can be made equal by substituting an existential 
variable with another term.
"""

Normally, merging is an operation that takes two terms and produce a 
single term. I do not understand what this definition has to do with 
"merging". After, it is said:

"""
Merging two terms t,t' results in either substitution S such that S(t) = 
S(t'), or failure indicating that there is no substitution that can be 
applied to both t and t' to make them equal.
"""

Now, merging results in a substitution. Again, it is at odd with 
standard definitions of "merging".


"""
Applying definitions, inferences, and constraints
"""

This section is really not needed if all is defined in terms of FOL. 
This section very strongly suggests one way of implementing 
PROV-CONSTRAINTS, which is not good. The description, and what's 
presented later in the normative sections, almost imposes inference 
materialisation as the normal way to implement all the logical 
operations on PROV statements (even though the spec says later, in 
passing, that implementations that lead to the same results are OK).

Using the notion of "failure" again strongly suggests that the spec is 
implemented with applications of operations indicated in this section 
and in Section 6, when in fact any consistency checker would be alright too.


"""
Checking ordering, typing, and impossibility constraints

The ordering, typing, and impossibility constraints are checked rather 
than applied.  This means that they do not generate new formulas 
expressible in PROV,
"""

This sentence again strongly suggests that constraints are "applied", in 
the sense defined previously, indicating that implementations should 
follow the operations mentioned previously, when in fact there is no 
reason to force any materialisation or "normalisastion" at all.
This is again supported by the following sentence:

"Checking such constraints follows a saturation strategy similar to that 
for normalization"

What follows is again dictating an implementation when everything is in 
fact just logical formulas that have to be consistency-checked.

PROV then uses the word "valid" where logic would use "consistent". In 
logic, a formula is valid iff it is necessarily true (it's a tautology). 
The use of a different term does not help non-logician (they would not 
be disturbed by the term "consistent") but may be a problem for 
logicians who are used to other terms.


"""
Equivalence and Isomorphism

[...]  This is similar to the notion of equivalence used in [RDF]
"""

RDF does not defined its own notion of equivalence. Equivalence in RDF 
is simply imposed by its formal semantics, following the standard notion 
of equivalence in all logics.


"""
[...] to be considered equivalent, the two instances must also be valid
"""

This is very strange, as "equivalence" is defined as a non-reflexive 
relationship. Again, the definition of equivalence would simply follow 
from a proper formalisation in FOL or model theory.

Moreover, again, the paragraph strongly suggests relying on "applying" 
normalisation to check equivalence.

See comments on Section 6.1.


"""
 From Instances to Bundles and Documents

[...] a toplevel instance consisting of the set of statements not 
appearing within a bundle
"""

This /seems/ to indicate that a statement in the top-level instance 
cannot be repeated in a bundle, while the PROV data model does not have 
this constraint.


"""
Analogously to blank nodes in [RDF], the scope of an existential 
variable in PROV is the instance level
"""

This is true for RDF graphs, but RDF doesn't say anything about the 
scope of bnodes in complex structures formed out of multiple RDF graphs, 
such as RDF datasets. This is therefore a constraint that must be 
indicated and highlighted in PROV.


Section 3:

This section uses vocabulary that is not yet defined normatively at this 
position in the document.

"""
If determining whether two PROV instances or documents are equivalent, 
an application must determine whether their normal forms are equal
"""

This is once more dictating a way of implementing, while relying on 
normal forms should be completely irrelevant, as indicated by the 
following sentence.


Section 4.1

This section defines equivalence of syntactic constructs that are purely 
a PROV-N issue. There is no reason to put this in the logic of PROV 
structures.


Section 5.1

Merging is defined in a strange way, with a procedure, which makes it 
difficult to grasp what notion it is supposed to convey.


Section 5.2

It is never specified explicitly that "strictly-precedes" is 
irreflexive. Therefore, the algorithm given in Section 6.1 for checking 
validity, which relies on this assumption, provides strictly more 
information than the definition of validity.


Section 6.1

"""
A PROV instance is valid if its normal form exists and satisfies all of 
the validity constraints
"""

Here, it is not clear what "satisfies" means. Satisfaction is a standard 
term in logic, provided that a proper semantics is defined. Here, there 
is no proper semantics so the notion of satisfaction would have to be 
defined. Assuming that PROV statements are indeed mapped to FOL, and 
constraints expressed as FOL axioms, then the standard notion of 
"satisfaction" would not work here. For example, the empty document 
doesn't satisfy any of the constraints (an empty FOL theory only 
satisfies tautologies, a.k.a., valid formulas, in the usual logical 
sense of "valid").

There wouldn't be any problem and it would be very much shorter with a 
proper first order semantics.


In this section, "equivalence" does not require validity, as opposed to 
Section 2.4.

Equivalence is defined according to the notion of isomorphic normal 
form, which is never normatively defined (the only definition being in 
the non-normative Section 2.4). In any case, there is no reason to rely 
on isomorphims at all, if only the logic of PROV was properly defined.

Unfortunately, since equivalence is given in terms of normal forms, and 
normal forms are given purely in procedural terms, there can be no 
sensible justification for 2 instances to be non-equivalent. The 
justification would be "because the procedure gave us different 
results", which isn't satisfactory at all. This is again a drawback of 
not providing an appropriate logical account of PROV statements.


Best regards,
-- 
Antoine Zimmermann
ISCOD / LSTI - Institut Henri Fayol
École Nationale Supérieure des Mines de Saint-Étienne
158 cours Fauriel
42023 Saint-Étienne Cedex 2
France
Tél:+33(0)4 77 42 66 03
Fax:+33(0)4 77 42 66 66
http://zimmer.aprilfoolsreview.com/
Received on Monday, 22 October 2012 13:04:48 UTC