# Issues of PROV Constraints

From: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Date: Mon, 22 Oct 2012 15:04:19 +0200
Message-ID: <50854453.2010406@emse.fr>

```Dear PROV working group,

I'm sorry to send my comments so late, but I hope you will take a close
attention to what I'm saying here.

I've read PROV data model and PROV constraints and while I find PROV-DM
satisfying, I have concerns about PROV Constraints.

The document tries to define a kind of semantics for the PROV-DM, but
it's not actually defining a proper semantics at all. And since it does
not truly define a formal semantics, it has to redefine the terms that
are borrowed from logic, and subtly diverge from the standard notions
found in logic. For someone who knows about logic, it makes the document
larger, more complicated than it should be, more difficult to read, and
more difficult to implement. For someone who does not know about logic,
it's not made at all simpler, on the contrary.

To give an example before getting into the details, let us examine the
notion of PROV-equivalence, in Section 2.4. Equivalence is a well known
and understood notion in maths and logics, which is necessarily
reflexive. In PROV, equivalence isn't reflexive! For X to be
PROV-equivalent to Y, it is required that both X and Y are PROV-valid,
which is the PROV term used to say "consistent".

If a proper semantics was defined, there would be no way the document
diverge from legacy logical concepts. There would be less chances for
mistakes in the definitions and in the rules laters. There would be more
flexibililty in the way implementations support the semantics.

I see two escapes for this:
1)  define an ad hoc model theory for PROV-DM, in the line of what's
sketched in PROV-SEM;
2)  rely on a well known logic (such as FOL), to which all PROV
statements are mapped, and add FOL axioms to model the constraints.

1) is ambitious and I understand there is not enough time to investigate
it enough. But 2) is really easy. PROV statements already look a lot
like FOL predicates, and PROV Constraints already rely heavily on FOL
notions and notations.

Definitions and inferences of 4.1 are not needed if the semantics is
defined on the abstract syntax. The abstract syntax is not PROV-N and
does not have "syntactic sugar" or "syntactic shortcuts". Meaning,
relations in abstract syntax always contain all the arguments, optional
or not, possibly with existential variables. The short forms only exist
in the surface syntax.

Inferences in 4.2, 4.3, 4.4 and 4.5 are all writable as FOL axioms.

Constraints require equality and an infinite total order (Time with
order "precedes"). This does not make the logic more complicated than FOL.

-----------------

Section 1.2:
"""
We define validity and equivalence in terms of a concept called
normalization.
"""

This is not required and suggests that implementations have to rely on
normalisation. The choice of formalisation, being entirely procedural,
imposes that normalisation is explicitely mentioned in almost all
definitions, which in turn asks for repeatedly marking that
implementations are not forced to produce normal forms. This obfuscates
the document, which could be smaller and simpler.

"""
Definitions, inferences, and uniqueness constraints can be applied to
normalize PROV instances, and event ordering, typing, and impossibility
constraints can be checked on the normal form to determine validity.
Equivalence of two PROV instances can be determined by comparing their
normal forms.
"""

This also strongly suggests normalising by "applying" constraints, as
later defined, although it is said later that normalisation is not
required in implementations.

"""
For PROV documents, validity and equivalence amount to checking the
validity or pairwise equivalence of their respective documents.
"""

should be: "... of their respective instances."

"""
This specification defines how validity and equivalence are to be
checked, if an application elects to support them at all.
"""

It was said before that implementations can choose the way they check
validity or equivalence. Again, this is a very strong suggestion that
implementation should be made according to the presented procedures.

"""
Applications [...] should treat equivalent instances or documents in the
same way.
"""

Does it mean that an instance in a bundle can be safely replaced by
another equivalent instance inside the bundle?  This is important
because it may be understood that a bundle identifier identifies the
exact set of PROV statements given in a PROV document, or that the
bundle identifier simply identifies the logical content of the bundle,
or even any content that logically "implies" the content of the bundle
in the said document.
The definitions of equivalence and inference suggests that it is the
later option. Note that this is something that may have to be understood
when writing PROV documents as RDF datasets, given that RDF 1.1 is not
specifying any particular relationship between a graph IRI and the
content of a named graph.

Section 2.4:

"""
Merging

Merging is an operation that takes two terms and compares them to see if
they are equal, or can be made equal by substituting an existential
variable with another term.
"""

Normally, merging is an operation that takes two terms and produce a
single term. I do not understand what this definition has to do with
"merging". After, it is said:

"""
Merging two terms t,t' results in either substitution S such that S(t) =
S(t'), or failure indicating that there is no substitution that can be
applied to both t and t' to make them equal.
"""

Now, merging results in a substitution. Again, it is at odd with
standard definitions of "merging".

"""
Applying definitions, inferences, and constraints
"""

This section is really not needed if all is defined in terms of FOL.
This section very strongly suggests one way of implementing
PROV-CONSTRAINTS, which is not good. The description, and what's
presented later in the normative sections, almost imposes inference
materialisation as the normal way to implement all the logical
operations on PROV statements (even though the spec says later, in
passing, that implementations that lead to the same results are OK).

Using the notion of "failure" again strongly suggests that the spec is
implemented with applications of operations indicated in this section
and in Section 6, when in fact any consistency checker would be alright too.

"""
Checking ordering, typing, and impossibility constraints

The ordering, typing, and impossibility constraints are checked rather
than applied.  This means that they do not generate new formulas
expressible in PROV,
"""

This sentence again strongly suggests that constraints are "applied", in
the sense defined previously, indicating that implementations should
follow the operations mentioned previously, when in fact there is no
reason to force any materialisation or "normalisastion" at all.
This is again supported by the following sentence:

"Checking such constraints follows a saturation strategy similar to that
for normalization"

What follows is again dictating an implementation when everything is in
fact just logical formulas that have to be consistency-checked.

PROV then uses the word "valid" where logic would use "consistent". In
logic, a formula is valid iff it is necessarily true (it's a tautology).
The use of a different term does not help non-logician (they would not
be disturbed by the term "consistent") but may be a problem for
logicians who are used to other terms.

"""
Equivalence and Isomorphism

[...]  This is similar to the notion of equivalence used in [RDF]
"""

RDF does not defined its own notion of equivalence. Equivalence in RDF
is simply imposed by its formal semantics, following the standard notion
of equivalence in all logics.

"""
[...] to be considered equivalent, the two instances must also be valid
"""

This is very strange, as "equivalence" is defined as a non-reflexive
relationship. Again, the definition of equivalence would simply follow
from a proper formalisation in FOL or model theory.

Moreover, again, the paragraph strongly suggests relying on "applying"
normalisation to check equivalence.

"""
From Instances to Bundles and Documents

[...] a toplevel instance consisting of the set of statements not
appearing within a bundle
"""

This /seems/ to indicate that a statement in the top-level instance
cannot be repeated in a bundle, while the PROV data model does not have
this constraint.

"""
Analogously to blank nodes in [RDF], the scope of an existential
variable in PROV is the instance level
"""

This is true for RDF graphs, but RDF doesn't say anything about the
scope of bnodes in complex structures formed out of multiple RDF graphs,
such as RDF datasets. This is therefore a constraint that must be
indicated and highlighted in PROV.

Section 3:

This section uses vocabulary that is not yet defined normatively at this
position in the document.

"""
If determining whether two PROV instances or documents are equivalent,
an application must determine whether their normal forms are equal
"""

This is once more dictating a way of implementing, while relying on
normal forms should be completely irrelevant, as indicated by the
following sentence.

Section 4.1

This section defines equivalence of syntactic constructs that are purely
a PROV-N issue. There is no reason to put this in the logic of PROV
structures.

Section 5.1

Merging is defined in a strange way, with a procedure, which makes it
difficult to grasp what notion it is supposed to convey.

Section 5.2

It is never specified explicitly that "strictly-precedes" is
irreflexive. Therefore, the algorithm given in Section 6.1 for checking
validity, which relies on this assumption, provides strictly more
information than the definition of validity.

Section 6.1

"""
A PROV instance is valid if its normal form exists and satisfies all of
the validity constraints
"""

Here, it is not clear what "satisfies" means. Satisfaction is a standard
term in logic, provided that a proper semantics is defined. Here, there
is no proper semantics so the notion of satisfaction would have to be
defined. Assuming that PROV statements are indeed mapped to FOL, and
constraints expressed as FOL axioms, then the standard notion of
"satisfaction" would not work here. For example, the empty document
doesn't satisfy any of the constraints (an empty FOL theory only
satisfies tautologies, a.k.a., valid formulas, in the usual logical
sense of "valid").

There wouldn't be any problem and it would be very much shorter with a
proper first order semantics.

In this section, "equivalence" does not require validity, as opposed to
Section 2.4.

Equivalence is defined according to the notion of isomorphic normal
form, which is never normatively defined (the only definition being in
the non-normative Section 2.4). In any case, there is no reason to rely
on isomorphims at all, if only the logic of PROV was properly defined.

Unfortunately, since equivalence is given in terms of normal forms, and
normal forms are given purely in procedural terms, there can be no
sensible justification for 2 instances to be non-equivalent. The
justification would be "because the procedure gave us different
results", which isn't satisfactory at all. This is again a drawback of
not providing an appropriate logical account of PROV statements.

Best regards,
--
Antoine Zimmermann
ISCOD / LSTI - Institut Henri Fayol
École Nationale Supérieure des Mines de Saint-Étienne
158 cours Fauriel
42023 Saint-Étienne Cedex 2
France
Tél:+33(0)4 77 42 66 03
Fax:+33(0)4 77 42 66 66
http://zimmer.aprilfoolsreview.com/
```
Received on Monday, 22 October 2012 13:04:48 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:50:04 UTC