Re: simon:entity (or Identifiable) from Reza B'Far on 2011-07-16 (public-prov-wg@w3.org from July 2011)

From: Reza B'Far <reza.bfar@oracle.com>
Date: Sat, 16 Jul 2011 10:56:54 -0700
To: public-prov-wg@w3.org
Message-ID: <4E21D0E6.9040407@oracle.com>
Jim -

I definitely think we're getting closer to talking about the same thing, but my 
concerns remain.  I'll try to address your paragraphs 1-by-1 below, but before 
that, let me use some examples of existing and prevalent software such as source 
code control systems and content management systems which contain a subset of 
what I believe we define as provenance here.  A source code control system (such 
as Subversion) has notions of a file, a graph of modifications to the file, 
etc.  While*I know the core functionality of those systems lies outside of the 
charter of this WG,* it should be relatively straight-forward for the subversion 
team, for example, to generate a model for source code files that developers 
author with some reasonable effort and without completely rethinking and 
redesigning their system.  Otherwise, I would claim that the standard fails 
since there are so many systems out there that work the same way (versioning, 
etc.) and if they have a tough time adopting, there is a fundamental issue.  Now 
to your email -

[Jim]
I expect there will be cases where I would claim a document exists before a 
version (which has content) exists. 'My project report' has a due date and other 
aspects that processes may change before I have any version.
[Reza]
I believe that the concept of the document existing before it exists (the 
meta-data about its formation like your project due date, etc.) are completely 
different than the document itself.  From a practical perspective, in the cases 
that I know of (take for example, legal documents, system configurations, etc) 
this is captured as a separate notion and I would hope we would learn from 
previously built systems.

[Jim]

You could claim that the document is still just a 0 length version to cover this, but I think when we say version, we're really trying to point at files (for example - it could be paper instead). That suggests to me that document is not really a root node as much as a different type of entity that at some point is associated with another type - files with 0 or more bytes in them representing the document content. I might need to describe how the document was created and aspects of it were changed or decided by processes (purpose, size, due date, scope, audience) before, at some point, someone creates an empty file that is now considered to be the document for editing purposes. I'm not sure how your DAG formulation would handle this.

[Reza]
It's not necessary for it to be a file.  When Monet painted Garden Path, there 
was an inception.  He may have thought of the painting before, have a deadline 
for himself, bought paint, whatever.  But there was no painting before there was 
a first stroke on a canvas after which you can call it a painting (even if it's 
incomplete).  Once there was a painting, there was some evolutionary process to 
completion of the first version, there were copies, there were modifications to 
the copies, etc. etc.  Root node is the painting Garden Path.  The actual 
thing.  Garden Path is Identifiable as Ryan calls it and can have an Identifier 
that points to the 0 length version.  That's the original.  The file 
representing that can then be an Identifier.  What you call "0 length version" 
is actually the Identifier.  The root node is the Identifiable.

[Jim]

I could also think of versions having their own subgraphs - I might create a version which then undergoes some approval process after which it becomes public (published perhaps). This example is meant to convey that any 'state-like' thing - a version that looks like a state of a document, may itself look like something that has more internal state that we then have to reapply our mechanisms to. I think that's how your DAG model would have to become hierarchical - doc/version is one level, version/unapproved-approved-published version is another. I agree its more complex than hierarchy since you have the time/processing chains/DAGs off of each thing as well, but I don't see how your model avoids hierarchy if you allow multiple levels of statefulness and allow each level to have independent provenance (i.e. the approval/publish processes that version1 goes through are not necessarily part of the history of version2 (which could have just come from the unapproved version1).

[Reza]
I don't see how this causes an issue.  You can take a DAG and create sub 
graphs.  There is nothing that keeps you from creating sub graphs of the super 
graph.  Again, this is common place in source code control systems.  Many people 
can modify the same file at the same time, create new files, etc.  This is not 
central to the argument.

On 7/16/11 7:38 AM, Myers, Jim wrote:
> Reza - I think what you're talking about is a combination of the IVPof and the core inputs-process execution - outputs model (the OPM-like core that fits the immutable thing case). The latter is ~agreed to and just hasn't been talked about lately. If I understand your DAG, I think I would say that's a document that first appears through some 'creative' process with a first version that is an IVPof the document. The first version then goes through a series of 'edit' process executions to create a DAG of future versions. (We've talked about a 'derived from' link that would be a direct connection between the versions versus a always linking through a process execution though I think there's still some discussion as to whether 'derived from' can be inferred. - that may be the more direct analog of the link in your DAG). In any case, each future version is also an IVPof the document - not sure if those links are in your DAG formulation or not.
>
> So one question is: are we talking about the same things yet?
>
> If so, I think there are some use cases/issues that make it hard to think of this as just a mutable thing and DAG of states (or first state is the mutable thing also) rather than a more general processing/derivation DAG along with a separate IVPof mechanism:
>
> I expect there will be cases where I would claim a document exists before a version (which has content) exists. 'My project report' has a due date and other aspects that processes may change before I have any version.
>
> You could claim that the document is still just a 0 length version to cover this, but I think when we say version, we're really trying to point at files (for example - it could be paper instead). That suggests to me that document is not really a root node as much as a different type of entity that at some point is associated with another type - files with 0 or more bytes in them representing the document content. I might need to describe how the document was created and aspects of it were changed or decided by processes (purpose, size, due date, scope, audience) before, at some point, someone creates an empty file that is now considered to be the document for editing purposes. I'm not sure how your DAG formulation would handle this.
>
> I could also think of one more edit to a file where the title is changed and we would now consider it to represent a different document - is that part of the same DAG? It seems cleaner to me to assert IVPof relationships for the versions/files that correspond to one document and to just assert that the next file in the processing chain is an IVPof a different document when that's true. Again, I'm not sure how that would look in the DAG formalization.
>
> I could also think of versions having their own subgraphs - I might create a version which then undergoes some approval process after which it becomes public (published perhaps). This example is meant to convey that any 'state-like' thing - a version that looks like a state of a document, may itself look like something that has more internal state that we then have to reapply our mechanisms to. I think that's how your DAG model would have to become hierarchical - doc/version is one level, version/unapproved-approved-published version is another. I agree its more complex than hierarchy since you have the time/processing chains/DAGs off of each thing as well, but I don't see how your model avoids hierarchy if you allow multiple levels of statefulness and allow each level to have independent provenance (i.e. the approval/publish processes that version1 goes through are not necessarily part of the history of version2 (which could have just come from the unapproved version1).
>
> Does that make sense?
>
>   Jim
>
>
> -----Original Message-----
> From: public-prov-wg-request@w3.org on behalf of Reza B'Far
> Sent: Sat 7/16/2011 2:37 AM
> To: public-prov-wg@w3.org
> Cc: public-prov-wg@w3.org
> Subject: Re: simon:entity (or Identifiable)
>
> Jim -
>
> I don't disagree with anything you're saying.  I think that didn't state my
> point well.  Let me see if I can clarify and if you still think that this is
> something that has been deemed outside of the scope.  To align with your email,
> I'll use your statement:
>
> We're debating:
>    how to define this relationship
>    whether the document and its versions are the same type/class in the model
>
>
> What I'm suggesting is to augment Ryan's model so that:
>
>   1. The very first version of a document is defined by an Identifier and is
>      Identifiable.
>   2. The DAG that I mentioned is a graph of "state" relationships where each
>      state is a node.  It's directed because time is different than any other
>      dimension, you can only move forward -- well, for practical purposes.  And
>      it can't loop on itself -- once you modify something, it's modified, you
>      can't undue it with respect to time.  It's a graph because multiple versions
>      can be made of the same source without needing to merge, but merging is also
>      possible.
>   3. It's a DAG that has only one root because there is an inception point for
>      any Bob/simon:entity/whatever.  It's created at some point and that very
>      first version at the creation time is different than all the other future
>      versions.  The atoms that made that thing didn't have the semantic meaning
>      as a collection before it was made.
>
> Based on this, I'm proposing that a document and its versions are the same AFTER
> the inception point.  But that there is a unique concept at the root node which
> is the thing at the inception point.  So, if you take Ryan's example,
> Identifiable and Identifier define the entity at the inception point.  But the
> graph itself is not a concept that I see anywhere, neither the nodes in the
> graph which are the states of the entity as delta changes to the previous state,
> linearly lined up in time.  I can't tell if IVPof represents the edges in the
> graph, but I think it does per everything I've read on the wiki so far... but am
> unsure.
>
> There is no hierarchy in what I'm outlining above.  Only capturing temporal
> behavior and saying that temporal behavior is different from all the other
> dimensions since it gives rise to the notion of state and that it should be
> captured uniquely.
>
> Example
> -----------
>
> (Legal Contract [Identifiable] at Inception Time) --->  (Modification 1 [State])
> ---->  (Modification 2 [State]) --->  ....
>                                                                                   |
>
> |-->  (Modification 3 [State]) ----->  (Modification 4 [State]) ---->.....
>
> Regards.
>
> On 7/15/11 4:50 PM, Myers, Jim wrote:
>> This is going in the direction of a hierarchy of 'states' of an identifier? If so - I don't think we have a hierarchy. If not, then I'm not sure what the DAG represents.
>>
>> I remember Graham making a comment at one point about trying to write a page that talked more about the purpose of the model (as he wrote for access) - I wonder if that would help. Here's my attempt to describe the requirements and where we agree/disagree in this style. (My take could be wrong but perhaps we would make progress by identifying if we disagree on requirements or where some are debating something that others consider resolved. If so, perhaps trying to modify the text below would help before we dive back to specific points).
>>
>> In the following, I intend only the English meanings of words unless otherwise noted.
>>
>> There's a set of things we've agreed to/ignored for a while related to the basic 'inputs - process execution -outputs' where the purpose of the model is to describe the history in cases where inputs and outputs are clear and the effects of a process execution are  captured by the set of inputs and outputs (i.e. the process execution can't just change an input).
>>
>> We also want to be able to model cases where the process execution does change something versus just using input and generating outputs. A document with versions is one example. In that case we're making the choice to model both the document and its versions and are adding a relationship (IVPof) between then to signify that the object we consider to be changing could instead be thought of as distinct objects (document with content1 and document with content2) that can be handled by the base input-process execution-output model.
>>
>> We're debating:
>>    how to define this relationship
>>    whether the document and its versions are the same type/class in the model
>>
>> We also expect the model to cover a third case - where we have two different things - e.g. a document and a file - that may both have provenance, but at some point have a correspondence - the file bytes represent the document's content. This case causes problems for IVPof definitions that involve hierarchy since one can't really consider either a document or a file to be more stateful versions of the other.
>>
>> This again leads to debate about the definition of IVPof. So far formulation of this concept has been attempted in terms of properties and dimensions as well as in terms of 'perspective relative to processes'. Some of the debate here has been when these definitions start to include hierarchy (thus not fitting the third use case), but it may be possible to formulate all three in ways that don't require hierarchy.
>>
>> This last use case also makes it harder to see a difference in the types of thing like document and version. In particular, if we can imagine more than one level for the second use case (e.g. document-version-encodedVersion), or think about the third case with no hierarchy, a two class system of thing and thing-state does not appear workable.
>>
>> Another issue that has arisen in the discussions is how to refer to things outside the model. We have several reasons we want to do this -
>>     to allow discovery of things with provenance using descriptive metadata/behavior/other context outside the model
>>     to aid in the definition of IVPof, where multiple hierarchies ala TBL and the third, non-hierarchical use case make it hard not to talk about something 'real' that both things involved in an IVPof relationship are describing/representing.
>>
>> Throughout we have trouble with nomenclature thing/entity/stuff/etc., describe/represent/view of/etc. which helps obscure when we do/don't agree.
>>
>> We(I anyway) may be confusing what the model contains versus how the model will be implemented (in RDF or in other languages we think in).
>>
>> I don't know that this is complete, but perhaps I can stop and ask whether this is already controversial or if it captures some of the nature of our debates?
>>
>>    Jim
>>
>> -----Original Message-----
>> From: public-prov-wg-request@w3.org on behalf of Reza B'Far
>> Sent: Fri 7/15/2011 2:22 PM
>> To: public-prov-wg@w3.org
>> Subject: Re: simon:entity (or Identifiable)
>>
>> Folks -
>>
>> I realize that the "R" word has been banned and am fine with that.  Here is a
>> suggestion for reconciliation of proposals/suggestions by Ryan, Jim(s), and Luc -
>>
>>    1. That we specify that Identifier is some "base-line" temporally identified as
>>       zero point (there exist no entity to be identified before this point).
>>    2. That we have a new concept that encapsulates a single "state" (sorry, I know
>>       that's another dangerous word) of identifier from that point on.  I don't
>>       want to give it a name so I'll call it set S{}.
>>    3. An Identifier can have a DAG (Directed Acyclic Graph) of S{} nodes where the
>>       DAG has a single root node and that root node has equivalence with the
>>       identifier itself.
>>
>> Just trying to reconcile at this point.
>>
>>
>> On 7/15/11 10:46 AM, Jim McCusker wrote:
>>> On Fri, Jul 15, 2011 at 12:06 PM, Myers, Jim<MYERSJ4@rpi.edu>    wrote:
>>>>> Being able to describe what the entity "looks like" at the time the
>>>>> provenance was recorded.
>>>>>
>>>>> My understanding was that a BOB was something like a named graph,
>>>> graph
>>>>> literal (http://webr3.org/blog/semantic-web/rdf-named-graphs-vs-graph-
>>>>> literals/),
>>>>> or information artifact similar to iao:Dataset. The Bob would then
>>>> have
>>>>> content that described, in some way, the entity in question.
>>>>> Hence the Bob being a description of an entity's state.
>>>> Do you distinguish 'description of an entity' from 'description of an
>>>> entity's state'? I get the sense that you are not using state in the
>>>> same sense of 'a more stateful view of' that is driving the discussion
>>>> of entity versus entity-state in the IVPof debates.
>>> Any description of an entity will occur with an entity in a particular
>>> state, and so two are the same.
>>>
>>>>> If it is possible to know, there should be assertions on the BOB
>>>> itself that say
>>>>> which entity the BOB is describing. Ideally, this is a URI of
>>>> something that's
>>>>> referenced within the BOB.
>>>> I'm hoping someone will chime in on this - I agree we need to connect
>>>> the idea of a bob with the entity, but I could see implementing that as
>>>> a link (as you say) or by saying that my entity's class is a subtype of
>>>> Bob (hence there's only one URL for the Bob and the entity).
>>> But that's clearly wrong, since Bobs only describe the state of an
>>> entity at one point/span of time and context. If the same entity is
>>> observed again, and a new Bob is created that describes the state
>>> differently, then there's nothing to tie it down. I'm guessing that by
>>> saying there is no referable entity outside of the Bob, then you can
>>> just make Bobs all the way down. But there would be no grounding to
>>> non-provenance resources in this case.
>>>
>>> The Bob is the description of something based on its state, the Entity
>>> is that something. A description of a thing is not the thing itself.
>>> Within the context of information systems, one can say that
>>> http://tw.rpi.edu/instances/JamesMcCusker is me. If you were to
>>> download the RDF from that URL that would contain a description of me
>>> within the context of RPI. The graph literal behind
>>> http://tw.rpi.edu/instances/JamesMcCusker is one description (that can
>>> change over time), and can be given an identifier using a graph digest
>>> [1], guaranteeing that we always talk about the same graph. But that
>>> graph is not me, even though the URI that returns it stands in for me
>>> in the semantic web.
>>>
>>> [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2187&rep=rep1&type=pdf
>>>
>>> Jim
>>> --
>>> Jim McCusker
>>> Programmer Analyst
>>> Krauthammer Lab, Pathology Informatics
>>> Yale School of Medicine
>>> james.mccusker@yale.edu | (203) 785-6330
>>> http://krauthammerlab.med.yale.edu
>>>
>>> PhD Student
>>> Tetherless World Constellation
>>> Rensselaer Polytechnic Institute
>>> mccusj@cs.rpi.edu
>>> http://tw.rpi.edu
>>>
Received on Saturday, 16 July 2011 17:57:55 UTC