Re: Best practice for specialization from Tom De Nies on 2012-04-02 (public-prov-wg@w3.org from April 2012)

From: Tom De Nies <tom.denies@ugent.be>
Date: Mon, 2 Apr 2012 16:00:23 +0200
To: Curt Tilmes <Curt.Tilmes@nasa.gov>
Cc: public-prov-wg@w3.org
Message-ID: <CA+=hbbd5gjcr2=KykT=EC80KDWbFMQ-+sjpOWGMSRDDPRTcVQQ@mail.gmail.com>
> Is L1v1r1 alternateOf L1v1?
>

I could be wrong, but I would say: no, it is not, since they are not a *
specialization* of the same entity.

In the way that you presented your view on the provenance, they are only *
derived* from the same entity, since you have the activities A1 and A3
using L0 and generating L1v1 and L1v1r1.
So i think you can assert wasDerivedFrom(L1v1,L0) and
wasDerivedFrom(L1v1r1, L0), but not alternateOf(L1V1,L1V1r1).

To support this further, Sam came up with a good counterexample for these
being alternate. If L0 consists of person-location information, and L1V1
only retains the persons, and L1V1r1 only retains the locations, both
datasets are derived from L0, but they are by no means alternates of each
other.

However, one could imagine to provide a sort of "conceptual view", and
introduce an entity:
 entity(inputDataForAlgorithmA)
Then, together with
 specializationOf(L1V1, inputDataForAlgorithmA)
 specializationOf(L1V1r1, inputDataForAlgorithmA)
you could say that
 alternateOf(L1V1,L1v1r1)
because they are both specializations of the input data.

So to sum up, I guess it depends on the granularity and the angle at which
you view the provenance. In the literal case, they do not seem alternates
to me, yet when looking at the conceptual representation, they are. Or is
this introducing too much semantics to the constraints?

Regards,
Tom
---
Tom De Nies
Ghent University - IBBT
Faculty of Engineering and Architecture
Department of Electronics and Information Systems - Multimedia Lab
Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

t: +32 9 331 49 59
e: tom.denies@ugent.be

URL:  http://multimedialab.elis.ugent.be



2012/4/2 Curt Tilmes <Curt.Tilmes@nasa.gov>

> On 04/02/2012 04:33 AM, Tom De Nies wrote:
>
>> I agree with Jim, that option 2 would be the safer option here.
>>
>> Since we are discussing best practices, I would assume that the best
>> practice would be to account for these "unexpected' events. If a
>> document is able to change, even when it is not expected to, one should
>> always provide the possibility to retain a correct provenance account.
>>
>> As you said, option 2 retains the correctness of the original account
>> provided with :doc, and increments it with the version-specific
>> provenance.
>> I think it is indeed a good idea to include this in the primer.
>>
>
> We've been working on a related use case concerning equivalence
> through reproducibility.
>
> From some input data L0, using activity A1, I derive a new dataset L1v1,
> then I do some work with L1v1, analyzing it, using it as model input,
> whatever.
>
> entity(L0)                 # The input level 0 data
> entity(L1v1)               # Version 1 of the level 1 data
> activity(A1)
> used(A1, L0)
> wasGeneratedBy(L1v1, A1)
>
> Then we discover a better way to create L1, so we make a new dataset
> L1v2 with a new activity A2.  L1v1 was really big, so we delete it.
>
> entity(L1v2)               # Version 2 of the level 1 data
> activity(A2)
> used(A2, L0)
> wasGeneratedBy(L1v2, A2)
>
> Some people like L1v2, but others question some of the published work
> and models that used L1V1, so they reproduce it.
>
> They try to follow all the the inputs and remake it identically to the
> way they did before (not a trivial task), so we end up with L1v1r1
>
> entity(L1v1r1)             # Reproduction 1 of version 1 of the level 1
> data
> activity(A3)
> used(A3, L0)
> wasGeneratedBy(L1v1r1, A3)
>
>
> While L1v2 is different from L1v1 by design (version 2 is a better way
> of making it), L1v1r1 is intended to be equivalent to L1v1 (difficult
> to prove in the general case, but if we have represented and conveyed
> sufficient information about A1, A3 should be our best reproduction of
> the generation process).
>
>
> While they are (should be) equivalent in content (assuming we got the
> reproduction right), they are certainly distinct entities.
>
>
> Now someone writes a paper describing work based on L1v1, and someone
> else writes a paper describing work based on L1v1r1.
>
>
> I want to examine assertions about the two papers to determine if they
> are writing about the 'same' dataset.
>
> In one sense, they are not.  L1v1 is not L1v1r1.  They were made at
> different times by different people, and we might have screwed up
> trying to reproduce A1 with A3 so they might actually be very
> different.  (Like a french translation of an english book might not be
> equivalent if the translator screwed up.)
>
> In another sense, L1v1r1 is intended to be equivalent to L1v1 (if we
> are claiming a process is reproducible, it should be possible to
> reproduce it.)
>
>
> Is L1v1r1 alternateOf L1v1?
>
>
> Curt
>
>
Received on Monday, 2 April 2012 14:00:59 UTC