Re: PROV comments from Clark&Parsia

('binary' encoding is not supported, stored as-is)
Hi again,

I just want to respond to a couple of points, and snipped the ones where we seem to be in agreement.

On Jan 14, 2013, at 8:44 PM, Héctor Pérez-Urbina <hector@clarkparsia.com> wrote:

> Hi James,
> 
> Thank you for your response.
> 
> 
> On Mon, Jan 14, 2013 at 10:49 AM, Evren Sirin <evren@clarkparsia.com> wrote:
<snip>
> > 1. The flow control arrows in Figure 1 seem to be backwards.
> 
> I believe this is because the arrows correspond to
> "derivation/generation/use" steps, which go backwards in time in PROV.
> 
> The caption of the figure is "Overview of the Validation Process" so I assumed that the process would start with the PROV instance(s) and then continue to 'Validity' after 'Normalization Activity'. In any case, this is a minor detail.
>  

Agreed; I have no problem with reversing the arrows, and on reflection tend to agree that this is clearer (since we're describing a process, rather than a specific run of a process :)


> 
> > 3. Since uniqueness constraints are ‘applied’ and can derive new
> > atoms, it is misleading to call them constraints. The same applies to
> > typing constraints.
> 
> This is a reasonable view, but I'd prefer to stick with the existing
> terminology.  The rationale for it is as follows:
> 
> - uniqueness "constraints" do not infer new PROV statements, rather
> they result in either merging two statements that contain compatible
> information, or in failure. 
> - typing "constraints" do not infer new PROV statements, rather they
> infer auxiliary non-PROV atomic formulas about typing of identifiers,
> which can in turn lead to detecting invalidity.
> - ordering "constraints" do not infer new PROV statements, rather they
> infer auxiliary ordering formulas that could in turn lead to
> invalidity.
> 
> I can see typeOf and (strictly) precedes as nonPROV predicates. But why is a statement resulting from a uniqueness constraint not considered a PROV statement? 
> 
> I'll use an example to clarify my point. Assume variable names start with an x or a y, and all other terms are assumed to be constants. Let's take Constraint 24:
> 
> C24. IF wasGeneratedBy(x0; x1, x2, x3, x4) and wasGeneratedBy(y0; x1, x2, y3, y4), THEN x0 = y0.
> 
> Now, let's suppose we have the following PROV instance I:
> 
> A1. wasGeneratedBy(gen1; e,a, t1, []).
> A2. wasGeneratedBy(y; e,a, t1, [attr1=value1]).
> 
> The enforcement process is as follows:
> 
> Given a constraint:
> 
> IF hyp1 and ... and hypn THEN t1 = u1 and ... and tn = un.
> 
> 1. Suppose PROV instance I contains all of the hypotheses hyp1 and ... and hypn.
> 
> There's an issue here. Instance I does not contain the body atoms of C24; it contains A1 and A2. You should probably say that "I contains atoms that can be unified with hyp1, ..., hypn via some substitution P".
> 

Yes, this is something that should be explained more clearly - in fact, what you want to do in general is freshen all of the variable names in hyp1...hypn to new (i.e. unused elsewhere) names, and find a substitution that simultaneously unifies each hypothesis with an atom in the instance.  

> 2. Attempt to unify all of the equated terms in the conclusion t1 = u1 and ... and tn = un.
> 3. If unification fails, then the constraint is unsatisfiable, so application of the constraint to I fails. If this failure occurs during normalization prior to validation, then I is invalid, as explained in Section 6.
> 
> We have that P(x0) = gen1 and P(y0) = y, so we need to unify gen1 and y, resulting in S = {y -> gen1}.
> 

[Minor] At this point it becomes clear that you intend "y" to be a variable, and not a constant (literal or URI).  This distinction matters because if both y and gen1 were distinct constants then unification would just fail here.

> 4. If unification succeeds with a substitution S, then S is applied to the instance I, yielding result S(I).
> 
> Here's a minor issue. It's unnecessary to apply S to the whole instance, it suffices to apply it to the atoms involved; namely, A1 and A2. Anyways, S(I) would look like:
> 
> A1. wasGeneratedBy(gen1; e,a, t1, []).
> A2. wasGeneratedBy(y; e,a, t1, [attr1=value1]).
> A3. wasGeneratedBy(gen1; e,a, t2, [attr1=value1]). // S(A2) = A3
> 

Minor point: In a separate (off-list) email you corrected A3 to:

> A3. wasGeneratedBy(gen1; e,a, t1, [attr1=value1]). // S(A2) = A3

Two substantive points:

1.  There could potentially be other occurrences of y in other statements in I.  These also need to be replaced with S(y), which is why we specify that S needs to be applied to the whole instance.  (There are more efficient ways of obtaining the same behavior, avoiding visiting the whole instance, but the behavior must be the same as the specification.)  For example, suppose the instance as a whole were:

A0. wasDerivedFrom(d; e,e0,a,y,u).
A1. wasGeneratedBy(gen1; e,a, t1, []).
A2. wasGeneratedBy(y; e,a, t1, [attr1=value1]).

Then if you just apply S to A1 and A2, you get

A0. wasDerivedFrom(d; e,e0,a,y,u).
A1'. wasGeneratedBy(gen1; e,a, t1, []).
A2'. wasGeneratedBy(gen1; e,a, t1, [attr1=value1]).

and the connection between the generation involved in derivation step A0 and that described by A2 is lost.  (In this case, it so happens that other constraints may force us to re-discover that y = gen1, but this is just luck.  If we replaced the generations by uses this would not be the case.)  

Generally, once we discover that a variable is equal to some other term, we should substitute the variable everywhere, as is done in the "TGD/EGD chase" in relational databases, which partly inspires what we are doing.

2.  For a uniqueness constraint, after applying the substitution S to I (or just to A1 ad A2), the specification doesn't say to add a new atom to I, rather, we have:

S(A1) = A1' =  wasGeneratedBy(gen1; e,a, t1, []).
S(A2) = A2' =  wasGeneratedBy(gen1; e,a, t1, [attr1=value1]).

This in turn is an instance of the KEY constraint on generation, so the two statements above will be merged; in this case this just has the effect of removing A1' which is redundant, leaving A2' which contains all of what is known about gen1 in one place.

> We have inferred some attributes for gen1. I don't see why A3 is not to be considered a PROV statement.

I agree that it is a statement, but to be pedantic, what I wrote was that uniqueness/key constraints do not infer *new* statements, that is, statements with additional information not present in the instance before.  Instead, such constraints propagate equality information through variables, or merge statements that provide complementary, compatible information about the same identified object/relationship.  At an RDF/OWL level (I believe) these would all translate to functional or inverse functional relationships, not subclass axioms.  In databases, these are almost always called functional dependencies or key constraints.

Does this help?  In any case, as noted below, we're happy to revisit the terminology - to me, though, "uniqueness inferences" does not sound right.

>  
> - only "inferences" result in new PROV statements being added to the
> instance, and only "constraints" can result in failure.
> 
> Of course, logically, distinguishing between PROV statements and
> auxiliary atoms is arbitrary, and there is no real reason to
> distinguish between inferences and constraints - we could simply call
> everything an inference.  If there is a general consensus that the
> existing terminology is confusing to implementors we will revisit
> this.
> 
> I wonder if it would be clearer to capture the notion of validity (and the different kinds of constraint) simply with impossibility constraints. For example, after capturing the semantics of precedes and strictlyPrecedes, we could have:
> 
> strictlyPrecedes(x, x) -> FALSE.
> 

Yes, we could and maybe should do this, but arguably this could be considered a change to the normative content of the specification, which is not to be taken lightly.  For precedes/strictlyPrecedes, we would need a number of additional rules for reflexivity, transitivity and inclusion of strictlyPrecedes in precedes.  Instead we have left these as prose definitions, but nothing stops you from implementing them using the exact same mechanisms as the inferences.

> Or for every pair <A,B> of disjoint types, we could have:
> 
> typeOf(x, A), typeOf(x, B) -> FALSE.
> 
> and so on.

Constraint 55 is of this form for A = entity and B = activity.  We do not assume that any other pairs of distinct types are disjoint.

>  
> 
> > 4. The definition of enforcement of uniqueness constraints states one
> > should apply the resulting substitution to the whole PROV instance.
> > However, the scope of the variables is not sets of rules.
> 
> I'm not sure I understand this comment.  I believe you are indicating
> that the explanation of what it means to apply a uniqueness constraint
> is unclear, because it has nonlocal effects on the whole instance that
> go beyond the statements mentioned in the constraint.  Will explaining
> this more clearly (possibly via an example) help, or am I missing the
> point?
> 
> I mentioned this point briefly in the example. In order to enforce a uniqueness constraint, it is not necessary to apply S to the whole instance.

See above.  It is intentional that the specification says to apply S to the whole instance.  It is possible (and encouraged) to use techniques that avoid visiting the whole instance while calculating S(I); for example, there are lots of tricks in the logic programming/theorem proving literature for doing unification efficiently.

I will try to make this point (and the intended behavior of uniqueness constraint application) clearer since it is evidently ambiguous, e.g. by adapting the example you gave.

Thanks again,
--James
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Received on Monday, 14 January 2013 22:37:00 UTC