- From: Myers, Jim <MYERSJ4@rpi.edu>
- Date: Mon, 18 Jul 2011 22:33:49 -0400
- To: <reza.bfar@oracle.com>, <public-prov-wg@w3.org>
- Message-ID: <B7376F3FB29F7E42A510EB5026D99EF205467F03@troy-be-ex2.win.rpi.edu>
Reza, The spam tag is getting added automatically somewhere - I'll try to keep removing it as well. My point about a view 'limited to one system/one witness' was not meant to imply that you were arguing based on a particular system, or that content systems are simple. I was just trying to make it clear that we don't usually model the world more than one way within a given system, so use cases based on content management, or workflow, or any other app in isolation don't usually capture the general case problem one has if the intent is to allow the integration of provenance across all the applications and processes that have caused a dataset to be. I think we have agreed that this case is of interest. Tracking the publication process in terms of approvals and licensing while also tracking the digital versions and physical file copies involves such multiple perspectives, particularly if one wants to catch problems, e.g. one of the copies was corrupted. If that's an incorrect assumption on my part, then we need to revisit. I don't like framing this as the provenance of XYZ beginning before XYZ exists - I think the question is whether we expect to track the provenance of legal, business, conceptual, digital, physical and other types of entities and need to link those provenance traces. (I.e. "do we need to connect the provenance of the conceptual painting with that of the physical painting", not "do we have to document the provenance of the potential physical painting before it exists"). Assuming that we do have this use case, we then have to make IVPof, or some other set of mechanisms clear. We are certainly struggling to do that to everyone's satisfaction. I'll suggest some variations on the definition and make some comments that might help move this forward. A and B can be any two things/bobs that we can record provenance about - whatever we call them, they are not a subset/subtype of provenance thing. For a definition, how about: An assertion that B is an IVPof A implies that, for the asserter, A and B are entities in different models of the world (different ontological types) and that, at the point of the assertion, B is a form, variation, or alternate view of A. Examples of IVPof relationships include versioning, entity-state relationships, logical-digital-physical correspondences, the FRBR work/expression/manifestation/item hierarchy, etc. This is about as well as the concept of version gets defined, so perhaps that is enough. To think about invariance, while the case where B is 'more invariant' than A is a common one, I think there are cases (in many previous emails) where the only real constraint on invariance is really that of the base derivation model - the input-process execution-output construct requires the inputs and outputs to be invariant with respect to the process execution. We could potentially point out that it is always possible, given an A and a process type that modifies A, to construct new things B and B' that are IVPof A and that are invariant to the process, i.e. B represents A-in-state-X and B' is A-in-state-Y. The consequence of this is that B inherits A's immutable properties and has additional ones, i.e. this is a consequence of this particular way of defining B and B', not a general consequence of the IVPof relationship. (Whether these are useful objects in a common sense sense is not clear - there may be clearer/better choices, perhaps ones where A is also fixed in ways B is not, but you can always create this type). This seems more like some guidance on use than a definition though. If it makes more sense to the group, I think one could also frame this guidance in terms of dimensions - B would be an n-1 dimensional projection of A when constructed this way. (Again - this is not generally true from IVPof, just from this particular construction). Does any of this help? Jim [Reza] My "world" is not "limited" to one witness, one management system, etc. If you look at source code management systems that manage billions of lines of code modified by tens of thousands of developers at big commercial organizations, I think you'll find the problem is actually fairly complex. Anyways, my personal concern is not with source code management. I was trying to use it as an example of very large system that needs provenance. So, let me retract my solution and instead, I'll say this. IVPof is too complicated and generic, IMO, for implementation as is documented. There are lots of threads that I've read on IVPof. Let me list my concerns/questions right now about IVPof as is posted on the F2F1 page - Text Reads: -------------- Let A and B be two entity states. An assertion "B is an IVP of A" indicates that, for its asserter, A and B represent the same entity in the world, and the entity states modelled by A and B are consistent. "B is an IVP of A" is valid only if, for its asserter, the following holds: * the properties they share must have corresponding values * some mutable properties of A correspond to some immutable properties of B B has invariant properties that have no correspondent for A ---------------- Concerns: 1. "Let A and B be two entity states". States of the same entity? What determines entity equivalence? If two entities are the same, does that mean they are mathematically the same or representationally (like URI) the same? 2. "An assertion "B is an IVP of A" indicates that, for its asserter, A and B represent the same entity in the world, and the entity states modelled by A and B are consistent." What does "consistent" mean? Can someone provide an exact definition of consistent referenced from general CS field that relates to this for my education purposes (as far as I know consistency has an exact definition that varies from field to field). 3. ""B is an IVP of A" is valid". What is "valid"? Is there a test for validity that's being defined? [Jim] [Jim] OK - what is that notion? Are you proposing an additional mechanism to be added to the provenance language to cover this beyond your root/ DAG model? It seems like we should be comparing the full set of things you think is needed against IVPof and derivation graph. [Reza] Your argument on derivation is reasonable. Again, my intent being expanding on Ryan's Identifier and Identifiable, all we would need is a bridge between Identifiable and Derivation and my requirements would be met. Then I don't see the need for IVPof. [Jim] BTW: One point of the discussion that you and Ryan may have missed was the idea of profiles from OPM. We haven't agreed that this should be in PIL, but there one use of the profile idea was to define a standard way to map external vocabularies to the provenance world. This was done with Dublin Core, but I would think could be done with a versioning ontology as well, or agent ones. The idea was to not reinvent the wheel and avoid multiple mappings from systems that used those vocabularies and perhaps even to promote use of those vocabularies with the provenance language to avoid balkanization while also avoiding limiting the provenance language to just the use cases and/or communities where those other ontologies fit. If in this case you're really arguing that there's a big set of use cases that could be dealt with via something that looks like versioning that is less general, a profile approach might be what you should argue for. [Reza] I'm personally not a big fan of this. If you do this, you dilute the value of the standard. As you proliferate outside of the core standard with extensions, profiles, etc., you explode out the amount of code that has to be written and maintained to deal with it and some point it just loses value. IMO, a good standard is something like HTML where they don't get it perfect generically, but they get enough right that they get a ton of adoption. I've mentioned the examples several times already, <p>, <div>, etc. [Jim] I agree that it does not have to be a file and could be a painting, but I don't think anything that you're saying argues against the need for a mechanism to track the 'conceptual painting' - there are many activities going on once the painting is conceptualized - it might be commissioned (money flows) and as you say there are things being bought, there could be sketches, test paintings (not of the whole scene, not things that could be called the physical painting). My point is not to argue that this is somehow the same as the physical painting, but that we need a system that tracks the provenance of the conceptual thing as well as the physical thing and gives us a way to connect them. [Reza] I think this is a core disagreement and we'll leave it at that. I disagree that all the stuff that comes before the physical manifestation of the thing can be considered part of the thing. Anything that is a part of the "thought about the thing" is something completely different and should be tracked separately. Every real system I've seen built has had to deal with this problem at some point (again, examples are content management systems, source control systems, publishing systems -- you can't ignore these, they are simply too large of use-cases). At some point, you have to define a hard-and-fast rule that says, "Ok, whatever is before is not related, let's create another instance of an abstraction". Otherwise, you end up with a knowledge graph of the entire universe and good luck trying to reason over that. [Jim] My subgraphs are not sub-versions - the content doesn't change in the approval process. Branches in code control systems are still just one DAG of things that differ in content. If I own the approval app, I may write my provenance separately for each file sent to me - I'm going to treat that file as the root of a DAG that has states of unapproved/reviewed/approved and a branch for rejected/appealed/approved, etc. In your content system, you're going to tell me there's a document that is a root and this file is a leaf, I'll tell you that each file is a root and has leaves representing different state (probably not files since my approval app doesn't change content). I don't think it will make sense to either application to treat that as one graph - your content system won't want to diff two of my leaves, etc. So again - if we want provenance to allow traversal of a graph like this (and I think we do given our use cases), you're model will have to include hierarchy of this sort and then some way to address the non-hierarchical third case I gave before your proposal will cover everything we need and be something we can compare with the IVPof mechanism. [Reza] My proposal was actually just an amendment to Ryan's. I still don't see hierarchies as necessary, but, I'm not religious about use of DAG as I used in my initial response. If someone has a counter proposal as an amendment to Ryan's, I'm all for it. I just find that Identifiable and Identifier are much better than what's there right now (I think it successfully resolved the BOB thread), but that something else needs to be added per comments from others on the beginning of this thread to keep track of states if Identifiable represents manifestation of physical object at time t=0. Perhaps either you, someone else, or chairs can give me the answer to this question - Has it been decided that we need to track the stuff related to XXX before XXX physically exists for this WG? To me, that's a huge question you've brought up in my mind. At least in my small brain, it's kind of similar to Open-World vs. Closed-World assumptions. On 7/16/11 12:38 PM, Myers, Jim wrote: Reza, I'm certainly away of how versioning is done in content systems and, if your world is limited to one witness, one management system, I think the world is as simple as you say. (Even then there are some versioning systems that give an ID to the document (root node) separately from that of the first version and they may have an ID for 'current version' as well because they recognize that the root node is not the same type of thing as the versions.) Regardless, in that case I think the idea of an IVPof and derivation graph does what you want (as in the last email-we'd have a 'root' document and versions with IVPof links from doc to versions and derivation links forming the graph of versions) and the question is really whether you're model can address the other cases. [Jim] I expect there will be cases where I would claim a document exists before a version (which has content) exists. 'My project report' has a due date and other aspects that processes may change before I have any version. [Reza] I believe that the concept of the document existing before it exists (the meta-data about its formation like your project due date, etc.) are completely different than the document itself. From a practical perspective, in the cases that I know of (take for example, legal documents, system configurations, etc) this is captured as a separate notion and I would hope we would learn from previously built systems. [Jim again] OK - what is that notion? Are you proposing an additional mechanism to be added to the provenance language to cover this beyond your root/ DAG model? It seems like we should be comparing the full set of things you think is needed against IVPof and derivation graph. BTW: One point of the discussion that you and Ryan may have missed was the idea of profiles from OPM. We haven't agreed that this should be in PIL, but there one use of the profile idea was to define a standard way to map external vocabularies to the provenance world. This was done with Dublin Core, but I would think could be done with a versioning ontology as well, or agent ones. The idea was to not reinvent the wheel and avoid multiple mappings from systems that used those vocabularies and perhaps even to promote use of those vocabularies with the provenance language to avoid balkanization while also avoiding limiting the provenance language to just the use cases and/or communities where those other ontologies fit. If in this case you're really arguing that there's a big set of use cases that could be dealt with via something that looks like versioning that is less general, a profile approach might be what you should argue for. [Jim] You could claim that the document is still just a 0 length version to cover this, but I think when we say version, we're really trying to point at files (for example - it could be paper instead). That suggests to me that document is not really a root node as much as a different type of entity that at some point is associated with another type - files with 0 or more bytes in them representing the document content. I might need to describe how the document was created and aspects of it were changed or decided by processes (purpose, size, due date, scope, audience) before, at some point, someone creates an empty file that is now considered to be the document for editing purposes. I'm not sure how your DAG formulation would handle this. [Reza] It's not necessary for it to be a file. When Monet painted Garden Path, there was an inception. He may have thought of the painting before, have a deadline for himself, bought paint, whatever. But there was no painting before there was a first stroke on a canvas after which you can call it a painting (even if it's incomplete). Once there was a painting, there was some evolutionary process to completion of the first version, there were copies, there were modifications to the copies, etc. etc. Root node is the painting Garden Path. The actual thing. Garden Path is Identifiable as Ryan calls it and can have an Identifier that points to the 0 length version. That's the original. The file representing that can then be an Identifier. What you call "0 length version" is actually the Identifier. The root node is the Identifiable. [Jim again] I agree that it does not have to be a file and could be a painting, but I don't think anything that you're saying argues against the need for a mechanism to track the 'conceptual painting' - there are many activities going on once the painting is conceptualized - it might be commissioned (money flows) and as you say there are things being bought, there could be sketches, test paintings (not of the whole scene, not things that could be called the physical painting). My point is not to argue that this is somehow the same as the physical painting, but that we need a system that tracks the provenance of the conceptual thing as well as the physical thing and gives us a way to connect them. [Jim] I could also think of versions having their own subgraphs - I might create a version which then undergoes some approval process after which it becomes public (published perhaps). This example is meant to convey that any 'state-like' thing - a version that looks like a state of a document, may itself look like something that has more internal state that we then have to reapply our mechanisms to. I think that's how your DAG model would have to become hierarchical - doc/version is one level, version/unapproved-approved-published version is another. I agree its more complex than hierarchy since you have the time/processing chains/DAGs off of each thing as well, but I don't see how your model avoids hierarchy if you allow multiple levels of statefulness and allow each level to have independent provenance (i.e. the approval/publish processes that version1 goes through are not necessarily part of the history of version2 (which could have just come from the unapproved version1). [Reza] I don't see how this causes an issue. You can take a DAG and create sub graphs. There is nothing that keeps you from creating sub graphs of the super graph. Again, this is common place in source code control systems. Many people can modify the same file at the same time, create new files, etc. This is not central to the argument. [Jim again] My subgraphs are not sub-versions - the content doesn't change in the approval process. Branches in code control systems are still just one DAG of things that differ in content. If I own the approval app, I may write my provenance separately for each file sent to me - I'm going to treat that file as the root of a DAG that has states of unapproved/reviewed/approved and a branch for rejected/appealed/approved, etc. In your content system, you're going to tell me there's a document that is a root and this file is a leaf, I'll tell you that each file is a root and has leaves representing different state (probably not files since my approval app doesn't change content). I don't think it will make sense to either application to treat that as one graph - your content system won't want to diff two of my leaves, etc. So again - if we want provenance to allow traversal of a graph like this (and I think we do given our use cases), you're model will have to include hierarchy of this sort and then some way to address the non-hierarchical third case I gave before your proposal will cover everything we need and be something we can compare with the IVPof mechanism. Jim On 7/16/11 7:38 AM, Myers, Jim wrote: Reza - I think what you're talking about is a combination of the IVPof and the core inputs-process execution - outputs model (the OPM-like core that fits the immutable thing case). The latter is ~agreed to and just hasn't been talked about lately. If I understand your DAG, I think I would say that's a document that first appears through some 'creative' process with a first version that is an IVPof the document. The first version then goes through a series of 'edit' process executions to create a DAG of future versions. (We've talked about a 'derived from' link that would be a direct connection between the versions versus a always linking through a process execution though I think there's still some discussion as to whether 'derived from' can be inferred. - that may be the more direct analog of the link in your DAG). In any case, each future version is also an IVPof the document - not sure if those links are in your DAG formulation or not. So one question is: are we talking about the same things yet? If so, I think there are some use cases/issues that make it hard to think of this as just a mutable thing and DAG of states (or first state is the mutable thing also) rather than a more general processing/derivation DAG along with a separate IVPof mechanism: I expect there will be cases where I would claim a document exists before a version (which has content) exists. 'My project report' has a due date and other aspects that processes may change before I have any version. You could claim that the document is still just a 0 length version to cover this, but I think when we say version, we're really trying to point at files (for example - it could be paper instead). That suggests to me that document is not really a root node as much as a different type of entity that at some point is associated with another type - files with 0 or more bytes in them representing the document content. I might need to describe how the document was created and aspects of it were changed or decided by processes (purpose, size, due date, scope, audience) before, at some point, someone creates an empty file that is now considered to be the document for editing purposes. I'm not sure how your DAG formulation would handle this. I could also think of one more edit to a file where the title is changed and we would now consider it to represent a different document - is that part of the same DAG? It seems cleaner to me to assert IVPof relationships for the versions/files that correspond to one document and to just assert that the next file in the processing chain is an IVPof a different document when that's true. Again, I'm not sure how that would look in the DAG formalization. I could also think of versions having their own subgraphs - I might create a version which then undergoes some approval process after which it becomes public (published perhaps). This example is meant to convey that any 'state-like' thing - a version that looks like a state of a document, may itself look like something that has more internal state that we then have to reapply our mechanisms to. I think that's how your DAG model would have to become hierarchical - doc/version is one level, version/unapproved-approved-published version is another. I agree its more complex than hierarchy since you have the time/processing chains/DAGs off of each thing as well, but I don't see how your model avoids hierarchy if you allow multiple levels of statefulness and allow each level to have independent provenance (i.e. the approval/publish processes that version1 goes through are not necessarily part of the history of version2 (which could have just come from the unapproved version1). Does that make sense? Jim -----Original Message----- From: public-prov-wg-request@w3.org on behalf of Reza B'Far Sent: Sat 7/16/2011 2:37 AM To: public-prov-wg@w3.org Cc: public-prov-wg@w3.org Subject: Re: simon:entity (or Identifiable) Jim - I don't disagree with anything you're saying. I think that didn't state my point well. Let me see if I can clarify and if you still think that this is something that has been deemed outside of the scope. To align with your email, I'll use your statement: We're debating: how to define this relationship whether the document and its versions are the same type/class in the model What I'm suggesting is to augment Ryan's model so that: 1. The very first version of a document is defined by an Identifier and is Identifiable. 2. The DAG that I mentioned is a graph of "state" relationships where each state is a node. It's directed because time is different than any other dimension, you can only move forward -- well, for practical purposes. And it can't loop on itself -- once you modify something, it's modified, you can't undue it with respect to time. It's a graph because multiple versions can be made of the same source without needing to merge, but merging is also possible. 3. It's a DAG that has only one root because there is an inception point for any Bob/simon:entity/whatever. It's created at some point and that very first version at the creation time is different than all the other future versions. The atoms that made that thing didn't have the semantic meaning as a collection before it was made. Based on this, I'm proposing that a document and its versions are the same AFTER the inception point. But that there is a unique concept at the root node which is the thing at the inception point. So, if you take Ryan's example, Identifiable and Identifier define the entity at the inception point. But the graph itself is not a concept that I see anywhere, neither the nodes in the graph which are the states of the entity as delta changes to the previous state, linearly lined up in time. I can't tell if IVPof represents the edges in the graph, but I think it does per everything I've read on the wiki so far... but am unsure. There is no hierarchy in what I'm outlining above. Only capturing temporal behavior and saying that temporal behavior is different from all the other dimensions since it gives rise to the notion of state and that it should be captured uniquely. Example ----------- (Legal Contract [Identifiable] at Inception Time) ---> (Modification 1 [State]) ----> (Modification 2 [State]) ---> .... | |--> (Modification 3 [State]) -----> (Modification 4 [State]) ---->..... Regards. On 7/15/11 4:50 PM, Myers, Jim wrote: This is going in the direction of a hierarchy of 'states' of an identifier? If so - I don't think we have a hierarchy. If not, then I'm not sure what the DAG represents. I remember Graham making a comment at one point about trying to write a page that talked more about the purpose of the model (as he wrote for access) - I wonder if that would help. Here's my attempt to describe the requirements and where we agree/disagree in this style. (My take could be wrong but perhaps we would make progress by identifying if we disagree on requirements or where some are debating something that others consider resolved. If so, perhaps trying to modify the text below would help before we dive back to specific points). In the following, I intend only the English meanings of words unless otherwise noted. There's a set of things we've agreed to/ignored for a while related to the basic 'inputs - process execution -outputs' where the purpose of the model is to describe the history in cases where inputs and outputs are clear and the effects of a process execution are captured by the set of inputs and outputs (i.e. the process execution can't just change an input). We also want to be able to model cases where the process execution does change something versus just using input and generating outputs. A document with versions is one example. In that case we're making the choice to model both the document and its versions and are adding a relationship (IVPof) between then to signify that the object we consider to be changing could instead be thought of as distinct objects (document with content1 and document with content2) that can be handled by the base input-process execution-output model. We're debating: how to define this relationship whether the document and its versions are the same type/class in the model We also expect the model to cover a third case - where we have two different things - e.g. a document and a file - that may both have provenance, but at some point have a correspondence - the file bytes represent the document's content. This case causes problems for IVPof definitions that involve hierarchy since one can't really consider either a document or a file to be more stateful versions of the other. This again leads to debate about the definition of IVPof. So far formulation of this concept has been attempted in terms of properties and dimensions as well as in terms of 'perspective relative to processes'. Some of the debate here has been when these definitions start to include hierarchy (thus not fitting the third use case), but it may be possible to formulate all three in ways that don't require hierarchy. This last use case also makes it harder to see a difference in the types of thing like document and version. In particular, if we can imagine more than one level for the second use case (e.g. document-version-encodedVersion), or think about the third case with no hierarchy, a two class system of thing and thing-state does not appear workable. Another issue that has arisen in the discussions is how to refer to things outside the model. We have several reasons we want to do this - to allow discovery of things with provenance using descriptive metadata/behavior/other context outside the model to aid in the definition of IVPof, where multiple hierarchies ala TBL and the third, non-hierarchical use case make it hard not to talk about something 'real' that both things involved in an IVPof relationship are describing/representing. Throughout we have trouble with nomenclature thing/entity/stuff/etc., describe/represent/view of/etc. which helps obscure when we do/don't agree. We(I anyway) may be confusing what the model contains versus how the model will be implemented (in RDF or in other languages we think in). I don't know that this is complete, but perhaps I can stop and ask whether this is already controversial or if it captures some of the nature of our debates? Jim -----Original Message----- From: public-prov-wg-request@w3.org on behalf of Reza B'Far Sent: Fri 7/15/2011 2:22 PM To: public-prov-wg@w3.org Subject: Re: simon:entity (or Identifiable) Folks - I realize that the "R" word has been banned and am fine with that. Here is a suggestion for reconciliation of proposals/suggestions by Ryan, Jim(s), and Luc - 1. That we specify that Identifier is some "base-line" temporally identified as zero point (there exist no entity to be identified before this point). 2. That we have a new concept that encapsulates a single "state" (sorry, I know that's another dangerous word) of identifier from that point on. I don't want to give it a name so I'll call it set S{}. 3. An Identifier can have a DAG (Directed Acyclic Graph) of S{} nodes where the DAG has a single root node and that root node has equivalence with the identifier itself. Just trying to reconcile at this point. On 7/15/11 10:46 AM, Jim McCusker wrote: On Fri, Jul 15, 2011 at 12:06 PM, Myers, Jim<MYERSJ4@rpi.edu> <mailto:MYERSJ4@rpi.edu> wrote: Being able to describe what the entity "looks like" at the time the provenance was recorded. My understanding was that a BOB was something like a named graph, graph literal (http://webr3.org/blog/semantic-web/rdf-named-graphs-vs-graph- literals/), or information artifact similar to iao:Dataset. The Bob would then have content that described, in some way, the entity in question. Hence the Bob being a description of an entity's state. Do you distinguish 'description of an entity' from 'description of an entity's state'? I get the sense that you are not using state in the same sense of 'a more stateful view of' that is driving the discussion of entity versus entity-state in the IVPof debates. Any description of an entity will occur with an entity in a particular state, and so two are the same. If it is possible to know, there should be assertions on the BOB itself that say which entity the BOB is describing. Ideally, this is a URI of something that's referenced within the BOB. I'm hoping someone will chime in on this - I agree we need to connect the idea of a bob with the entity, but I could see implementing that as a link (as you say) or by saying that my entity's class is a subtype of Bob (hence there's only one URL for the Bob and the entity). But that's clearly wrong, since Bobs only describe the state of an entity at one point/span of time and context. If the same entity is observed again, and a new Bob is created that describes the state differently, then there's nothing to tie it down. I'm guessing that by saying there is no referable entity outside of the Bob, then you can just make Bobs all the way down. But there would be no grounding to non-provenance resources in this case. The Bob is the description of something based on its state, the Entity is that something. A description of a thing is not the thing itself. Within the context of information systems, one can say that http://tw.rpi.edu/instances/JamesMcCusker is me. If you were to download the RDF from that URL that would contain a description of me within the context of RPI. The graph literal behind http://tw.rpi.edu/instances/JamesMcCusker is one description (that can change over time), and can be given an identifier using a graph digest [1], guaranteeing that we always talk about the same graph. But that graph is not me, even though the URI that returns it stands in for me in the semantic web. [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.2187&rep=rep1 &type=pdf Jim -- Jim McCusker Programmer Analyst Krauthammer Lab, Pathology Informatics Yale School of Medicine james.mccusker@yale.edu | (203) 785-6330 http://krauthammerlab.med.yale.edu PhD Student Tetherless World Constellation Rensselaer Polytechnic Institute mccusj@cs.rpi.edu http://tw.rpi.edu
Received on Tuesday, 19 July 2011 02:49:33 UTC