Re: writing a simple example in prov-o, help from Simon Miles on 2011-10-25 (public-prov-wg@w3.org from October 2011)

From: Simon Miles <simon.miles@kcl.ac.uk>
Date: Tue, 25 Oct 2011 16:54:21 +0100
To: Provenance Working Group WG <public-prov-wg@w3.org>
Message-ID: <CAKc1nHcfjGyxk1gRj-bD4vndWBCH4EExvRhdMSF=bVdJA65F6w@mail.gmail.com>
Paul, all,

Just to properly understand why what is being discussed is important,
I wanted to expand your example to a larger use case.

At time T, you say something about a video on your blog and assert:
<http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-fundamental-for-people/>
prov:wasDerivedFrom
<http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>.

At time T+1, the video is edited to introduce a previously missing
segment that undermines the message of your blog entry. The video URI
stays the same.

At time T+2, I say something about the (updated) video on my blog and assert:
<http://inkings.org/2011/10/08/why-provenance-is-pointless/>
prov:wasDerivedFrom
<http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>.

We could then observe:
 - Even if the above use case doesn't happen to you, by using the
simplest form of provenance you are opening the possibility of it
happening and you would not even know about it.
 - It doesn't help to say that the video owners shouldn't use the same
URL, because it is not under the control of either those creating or
consuming the provenance.
 - There is nothing apparently wrong with either of our assertions
(except the lack of characterisation), and I don't know anything about
your blog so don't take it into account in my blog's provenance.
 - It seems reasonable criteria for interoperability that if you read
Prov-DM from two separate sources referring to the same entity, then
either there is an error in (at least) one or they are mutually
consistent. I couldn't see what this would correspond to in the
interoperability discussion [1] though.

Thanks,
Simon

[1] http://www.w3.org/2011/prov/wiki/Interoperability


On 25 October 2011 10:02, Graham Klyne <GK@ninebynine.org> wrote:
> On 24/10/2011 13:43, Myers, Jim wrote:
>> A couple thoughts:
>>
>> When we say B wasderivedfrom A where both A and B are changing, I think the meaning we want is that some complement of A with content fixed as of the time of the derivation was used to produce a complement of B with content fixed as at the time of derivation. If that's the case, do we just need a shorthand to define such entities, i.e. to define an entity as one that characterized a URI (perhaps at a time) without then creating an identifier for it (a blank node) or explicitly stating it as a complement of the URI it characterizes? I think this is consistent with the model in the sense that how entities characterize things is defined in terms of both attributes and their provenance (Luc in Boston who arrived by train today) - saying that I'm defining an 'entity characterizing Luc' that I can then use to assert that that entity flew on a plane out of Boston is really just an alternate way of defining Luc-in-Boston. Allowing an optional timestamp for when the entity character
> ized the living URI just fixes where in the provenance graph an entity must be (given timestamps on other processes, etc.), e.g. a shorthand that would allow integration with another account that said Luc-in-Boston arrived by train at time X.
>
> I'm not sure that "B wasderivedfrom A where both A and B are changing" is a
> meaningful statement.  There's a tense mix-up there if nothing else :)
>
> But, more seriously, w.r.t. "to define an entity as one that characterized a URI
> (perhaps at a time) without then creating an identifier for it (a blank node)" -
> I feel fairly strongly that trying to avoid creating an RDF node doesn't really
> help - if it's possible to do.  RDF statements have to be associated with a node
> (actually 2 :), not counting the property) - whether that node is blank
> (existential) or identified with a URI is a separate consideration.  And the
> node, in RDF, *is* the identifier.
>
>> Regardless of how we identify/define such entities (whether as above or the other options in this thread), I think one can avoid having to document things like creation times that one does not know about - rather than affixing a timestamp to a generation process (prov:wasGeneratedAt examples in the thread), one could record when materials were viewed/accessed: I could say B wasderivedfrom A, A participatedIn an access process today, B participated in an access process today which would mean that whenever in the past A and B were created, they are the same entities (same content) as when I accessed them today, i.e. I'm asserting that the B as I saw it today was derived from A as I saw it today at some point in the past). Making that slightly more generic - an asserter could report whatever process they did to characterize the entity - we wouldn't be limited to talking about generation.
>
> I think I agree with this bit.
>
> #g
> --
>
>>> -----Original Message-----
>>> From: public-prov-wg-request@w3.org [mailto:public-prov-wg-
>>> request@w3.org] On Behalf Of Paul Groth
>>> Sent: Friday, October 21, 2011 5:24 PM
>>> To: Luc Moreau
>>> Cc: public-prov-wg@w3.org
>>> Subject: Re: writing a simple example in prov-o, help
>>>
>>> Hi Luc, all:
>>>
>>> That's good. I think this gives the basis for writing some simple examples.
>>>
>>> With regards to Section 8, I wanted to clarify a couple things
>>>
>>> - It would be good to check the ramifications of the duality of identifiers in
>>> particular with respect to Semweb definitions. My thought is that this should
>>> be alright because of the open world assumption but does anybody see any
>>> problems?
>>>
>>> - The duality of reusing the identifier heavily relies on accounts. But in most
>>> cases people won't assert an account. Is their some default account? What's
>>> the policy? I think one could assume that every expression was in its own
>>> account unless otherwise specified. Or is everything in one general account?
>>>
>>> cheers,
>>> Paul
>>>
>>>
>>> Luc Moreau wrote:
>>>> Your sugestion, Paul, is indeed supported by DM. Look at section 8 And
>>> imagine an empty list of attributes.
>>>>
>>>> But, as you say, it's weak characterisation.
>>>>
>>>> Professor Luc Moreau
>>>> Electronics and Computer Science
>>>> University of Southampton
>>>> Southampton SO17 1BJ
>>>> United Kingdom
>>>>
>>>> On 21 Oct 2011, at 18:16, "Paul Groth"<p.t.groth@vu.nl>   wrote:
>>>>
>>>>> HI Stian, All:
>>>>>
>>>>> This is exactly what I was afraid of. At a minimum, we need really simple
>>> ways of describing the provenance of web pages. You shouldn't have to
>>> understand accounts or even the notion of characterized thing to use our
>>> vocabulary. It should just work and we should be able to interpret these
>>> statements with respect to the prov-dm world view.
>>>>>
>>>>> My perspective is that once you says something is of type provo:Entity
>>> then it should be "characterized" from that perspective (i.e. account). It may
>>> not be a "good" characterization but that shouldn't matter.
>>>>>
>>>>> It would be interesting if this suggested approach fits into the PROV-DM
>>> model. Luc, Paolo?
>>>>>
>>>>> cheers
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>> Stian Soiland-Reyes wrote:
>>>>>> On Fri, Oct 21, 2011 at 15:41, Paul Groth<p.t.groth@vu.nl>    wrote:
>>>>>>
>>>>>>> I want to say that the post was derived from the video.
>>>>>>> Here's what I naturally wrote down:
>>>>>>> @prefix prov:<http://www.w3.org/ns/prov-o/>.
>>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> funda
>>>>>>> mental-for-people/>
>>>>>>> prov:wasDerivedFrom
>>>>>>>
>>> <http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>.
>>>>>>> This implies that both the post and the youtube video are of type
>>>>>>> prov:Entity.
>>>>>>> But that seems wrong because they are not characterized things.
>>>>>>> They could change. Or is the url enough of a characterization?
>>>>>> If you think the resource behind the URIs might change (as most
>>>>>> can), you should provide some attributes to help describe the
>>>>>> entity. I believe it COULD be valid for you to use the "real" URIs
>>>>>> here, as your simple account does not cover the earlier or later
>>>>>> versions of the two resources.
>>>>>>
>>>>>> You should however then include some attributes to help merge with
>>>>>> other accounts which might have a different view, as a minimum a
>>>>>> timestamp or description of the content.
>>>>>>
>>>>>> We don't really have a generic timestamp feature in PROV, but you
>>>>>> can say when an entity was generated:
>>>>>>
>>>>>>
>>>>>> <http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>
>>>>>>      prov:wasGeneratedAt [ time:inXSDDateTime "2011-10-17T18:25:00Z" ]
>>> .
>>>>>>
>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> fundamental-for-people/>
>>>>>>      prov:wasGeneratedAt [ time:inXSDDateTime "2011-10-17T18:30:00Z" ]
>>> .
>>>>>>
>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> fundamental-for-people/>
>>>>>>     prov:wasDerivedFrom
>>>>>>
>>> <http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>    .
>>>>>>
>>>>>>
>>>>>> (I'm not too comfortable with this approach either - because the
>>>>>> asserter is in a way claiming that the TED talk HTML was created at
>>>>>> 18:25, which is probably not something you as the asserter know.  By
>>>>>> PROV-DM this should be kinda-OK, he is merely identifying an entity,
>>>>>> which describes a thing in the world - which in this case is a web
>>>>>> page. Different accounts don't need to agree on their entity
>>>>>> descriptions or provenance assertions even if they are using the
>>>>>> same identifiers (and somehow are talking about the same things).
>>>>>>
>>>>>> Of course, as pointed out by Satya "URIs have a global scope and are
>>>>>> interpreted consistently regardless of context" - so I should not
>>>>>> just make up an URI like
>>>>>> <http://thinklinks.wordpress.com/stian-stole-your-namespace>    and
>>> claim
>>>>>> that this URI shows the location of my slippers - we should both
>>>>>> interpret this as a identifying the resource
>>>>>> "stian-stole-your-namespace" on the HTTP server reachable by the DNS
>>>>>> name thinklinks.wordpress.com.
>>>>>>
>>>>>>
>>>>>> Approaches like the PAV ontology
>>>>>> (http://code.google.com/p/pav-ontology/) solves the timestamp issue
>>>>>> by an intermediary:
>>>>>>
>>>>>> :doc a pav:Sourcedocument ;
>>>>>>     pav:retrievedFrom
>>>>>> <http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>
>>> ;
>>>>>>     pav:sourceAccessedOn "2011-10-17T18:25:00Z" .
>>>>>>
>>>>>> However here we have introduced an intermediary :doc (similar to our
>>>>>> prov:Entity) which you still need to mint an URI for.
>>>>>>
>>>>>>
>>>>>>
>>>>>> A different account which includes several revisions of the
>>>>>> resource, provided by Wordpress database, for instance, would need
>>>>>> to identify each of these using other identifiers, such as local IDs
>>>>>> in the RDF
>>>>>> document:
>>>>>>
>>>>>> @prefix prov:<http://www.w3.org/ns/prov-o/>    .
>>>>>> @prefix time:<http://www.w3.org/2006/time#>    .
>>>>>>
>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> fundamental-for-people/>
>>>>>>     prov:wasGeneratedAt :creationTime .
>>>>>>
>>>>>> :creationTime a prov:Time ;
>>>>>>     time:inXSDDateTime "2011-10-15T15:00Z" .
>>>>>>
>>>>>> :blog1 a prov:Entity;
>>>>>>     prov:wasGeneratedAt :creationTime ;
>>>>>>     # i.e. generated at same time as:
>>>>>>     prov:wasComplementOf
>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> fundam
>>>>>> ental-for-people/>
>>>>>> .
>>>>>>
>>>>>>
>>>>>> :tedTalk a prov:Entity ;
>>>>>>    # So this is not the generation time of the talk HTML - but
>>>>>>    # the generation time of the overlapping entity description
>>>>>>    # (as the author saw it and embedded its video in :blog2)
>>>>>>    prov:wasGeneratedAt [ time:inXSDDateTime "2011-10-17T18:25:00Z" ]
>>> ;
>>>>>>    prov:wasComplementOf
>>>>>> <http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.html>
>>> .
>>>>>>
>>>>>> :blog2 a prov:Entity ;
>>>>>>     prov:wasGeneratedAt [ time:inXSDDateTime "2011-10-17T18:30:00Z" ]
>>> ;
>>>>>>     prov:wasComplementOf
>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> fundam
>>>>>> ental-for-people/>
>>>>>> ;
>>>>>>     notYetInProv:wasRevisionOf :blog1 ;
>>>>>>     prov:wasDerivedFrom :blog1 ;
>>>>>>     # Embedded the video this time
>>>>>>     prov:wasDerivedFrom :tedTalk .
>>>>>>
>>>>>>
>>>>>> I much prefer this approach, but it does become more verbose. It
>>>>>> still
>>>>>>
>>> makes<http://www.ted.com/talks/paul_bloom_the_origins_of_pleasure.ht
>>>>>> ml>  an prov:Entity - but we don't say anything more about it because
>>>>>> we simply don't know its provenance.
>>>>>>
>>>>>>
>>>>>> (I still believe that we need something stronger than
>>>>>> wasComplementOf above - we know for a fact that :blog2 is fully within
>>> the timespan of
>>>>>>
>>>>>> <http://thinklinks.wordpress.com/2011/07/31/why-provenance-is-
>>> fundam
>>>>>> ental-for-people/>  but I can't see how to express this in PROV)
>>>>>>
>>>>>>
>>>>> --
>>>>> Dr. Paul Groth (p.t.groth@vu.nl)
>>>>> http://www.few.vu.nl/~pgroth/
>>>>> Assistant Professor
>>>>> Knowledge Representation&   Reasoning Group Artificial Intelligence
>>>>> Section Department of Computer Science VU University Amsterdam
>>>>>
>>>>>
>>>
>>> --
>>> Dr. Paul Groth (p.t.groth@vu.nl)
>>> http://www.few.vu.nl/~pgroth/
>>> Assistant Professor
>>> Knowledge Representation&  Reasoning Group Artificial Intelligence Section
>>> Department of Computer Science VU University Amsterdam
>>>
>>
>>
>>
>
>



-- 
Dr Simon Miles
Lecturer, Department of Informatics
Kings College London, WC2R 2LS, UK
+44 (0)20 7848 1166
Received on Tuesday, 25 October 2011 15:54:56 UTC