Re: Best practice for specialization

On 04/02/2012 04:33 AM, Tom De Nies wrote:
> I agree with Jim, that option 2 would be the safer option here.
>
> Since we are discussing best practices, I would assume that the best
> practice would be to account for these "unexpected' events. If a
> document is able to change, even when it is not expected to, one should
> always provide the possibility to retain a correct provenance account.
>
> As you said, option 2 retains the correctness of the original account
> provided with :doc, and increments it with the version-specific provenance.
> I think it is indeed a good idea to include this in the primer.

We've been working on a related use case concerning equivalence
through reproducibility.

 From some input data L0, using activity A1, I derive a new dataset L1v1,
then I do some work with L1v1, analyzing it, using it as model input,
whatever.

entity(L0)                 # The input level 0 data
entity(L1v1)               # Version 1 of the level 1 data
activity(A1)
used(A1, L0)
wasGeneratedBy(L1v1, A1)

Then we discover a better way to create L1, so we make a new dataset
L1v2 with a new activity A2.  L1v1 was really big, so we delete it.

entity(L1v2)               # Version 2 of the level 1 data
activity(A2)
used(A2, L0)
wasGeneratedBy(L1v2, A2)

Some people like L1v2, but others question some of the published work
and models that used L1V1, so they reproduce it.

They try to follow all the the inputs and remake it identically to the
way they did before (not a trivial task), so we end up with L1v1r1

entity(L1v1r1)             # Reproduction 1 of version 1 of the level 1 data
activity(A3)
used(A3, L0)
wasGeneratedBy(L1v1r1, A3)


While L1v2 is different from L1v1 by design (version 2 is a better way
of making it), L1v1r1 is intended to be equivalent to L1v1 (difficult
to prove in the general case, but if we have represented and conveyed
sufficient information about A1, A3 should be our best reproduction of
the generation process).


While they are (should be) equivalent in content (assuming we got the
reproduction right), they are certainly distinct entities.


Now someone writes a paper describing work based on L1v1, and someone
else writes a paper describing work based on L1v1r1.


I want to examine assertions about the two papers to determine if they
are writing about the 'same' dataset.

In one sense, they are not.  L1v1 is not L1v1r1.  They were made at
different times by different people, and we might have screwed up
trying to reproduce A1 with A3 so they might actually be very
different.  (Like a french translation of an english book might not be
equivalent if the translator screwed up.)

In another sense, L1v1r1 is intended to be equivalent to L1v1 (if we
are claiming a process is reproducible, it should be possible to
reproduce it.)


Is L1v1r1 alternateOf L1v1?


Curt

Received on Monday, 2 April 2012 12:58:17 UTC