Re: Best practice for specialization from Timothy Lebo on 2012-04-02 (public-prov-wg@w3.org from April 2012)

From: Timothy Lebo <lebot@rpi.edu>
Date: Mon, 2 Apr 2012 09:25:48 -0400
To: Curt Tilmes <Curt.Tilmes@nasa.gov>
Cc: <public-prov-wg@w3.org>
Message-Id: <EAC33E04-4887-4148-BE67-AA41703C2D3B@rpi.edu>
Curt,

I listed this example at http://www.w3.org/2011/prov/wiki/PROV_OWL_ontology_component_examples#NASA_reproducing_big_datasets

so that hopefully someday it can make it to 

http://www.w3.org/2011/prov/wiki/PROV_examples

Regards,
Tim


On Apr 2, 2012, at 8:57 AM, Curt Tilmes wrote:

> On 04/02/2012 04:33 AM, Tom De Nies wrote:
>> I agree with Jim, that option 2 would be the safer option here.
>> 
>> Since we are discussing best practices, I would assume that the best
>> practice would be to account for these "unexpected' events. If a
>> document is able to change, even when it is not expected to, one should
>> always provide the possibility to retain a correct provenance account.
>> 
>> As you said, option 2 retains the correctness of the original account
>> provided with :doc, and increments it with the version-specific provenance.
>> I think it is indeed a good idea to include this in the primer.
> 
> We've been working on a related use case concerning equivalence
> through reproducibility.
> 
> From some input data L0, using activity A1, I derive a new dataset L1v1,
> then I do some work with L1v1, analyzing it, using it as model input,
> whatever.
> 
> entity(L0)                 # The input level 0 data
> entity(L1v1)               # Version 1 of the level 1 data
> activity(A1)
> used(A1, L0)
> wasGeneratedBy(L1v1, A1)
> 
> Then we discover a better way to create L1, so we make a new dataset
> L1v2 with a new activity A2.  L1v1 was really big, so we delete it.
> 
> entity(L1v2)               # Version 2 of the level 1 data
> activity(A2)
> used(A2, L0)
> wasGeneratedBy(L1v2, A2)
> 
> Some people like L1v2, but others question some of the published work
> and models that used L1V1, so they reproduce it.
> 
> They try to follow all the the inputs and remake it identically to the
> way they did before (not a trivial task), so we end up with L1v1r1
> 
> entity(L1v1r1)             # Reproduction 1 of version 1 of the level 1 data
> activity(A3)
> used(A3, L0)
> wasGeneratedBy(L1v1r1, A3)
> 
> 
> While L1v2 is different from L1v1 by design (version 2 is a better way
> of making it), L1v1r1 is intended to be equivalent to L1v1 (difficult
> to prove in the general case, but if we have represented and conveyed
> sufficient information about A1, A3 should be our best reproduction of
> the generation process).
> 
> 
> While they are (should be) equivalent in content (assuming we got the
> reproduction right), they are certainly distinct entities.
> 
> 
> Now someone writes a paper describing work based on L1v1, and someone
> else writes a paper describing work based on L1v1r1.
> 
> 
> I want to examine assertions about the two papers to determine if they
> are writing about the 'same' dataset.
> 
> In one sense, they are not.  L1v1 is not L1v1r1.  They were made at
> different times by different people, and we might have screwed up
> trying to reproduce A1 with A3 so they might actually be very
> different.  (Like a french translation of an english book might not be
> equivalent if the translator screwed up.)
> 
> In another sense, L1v1r1 is intended to be equivalent to L1v1 (if we
> are claiming a process is reproducible, it should be possible to
> reproduce it.)
> 
> 
> Is L1v1r1 alternateOf L1v1?
> 
> 
> Curt
> 
>
Received on Monday, 2 April 2012 13:26:21 UTC