Re: gap analysis (input regarding PML) from Paul Groth on 2010-08-06 (public-xg-prov@w3.org from August 2010)

From: Paul Groth <pgroth@gmail.com>
Date: Fri, 06 Aug 2010 13:37:48 +0200
To: Paulo Pinheiro da Silva <paulo@utep.edu>
CC: Luc Moreau <L.Moreau@ecs.soton.ac.uk>, "public-xg-prov@w3.org" <public-xg-prov@w3.org>, "Arora, Jitin BTE" <jarora@miners.utep.edu>, Tim Lebo <lebot@rpi.edu>, "Deborah L. McGuinness" <dlm@cs.rpi.edu>
Message-ID: <4C5BF40C.4050106@gmail.com>
Hi Paulo,

Ok, now we're getting somewhere. The thing I would point out is that we 
as a group spent a significant amount of time doing both technical and 
user requirements gathering. So we've moved on to trying to identify the 
gaps in available solutions.

Obviously, I'm glad to answer any technical questions you might have. I 
just want to make sure that technical questions did not cloud the point 
of the exercise.

Cheers,
Paul



Paulo Pinheiro da Silva wrote:
> Hi Paul,
>
> I cannot see the misunderstanding. In requirement engineering, 
> requirements are expected to be elicited from all possible sources of 
> information including users, documents, conventions, guidelines, etc. 
> Use cases are one way of doing such elicitation from users. The 
> inspection and analysis of existing technologies is a complementary 
> way of eliciting requirements.
>
> From your message, I am under the impression that you may be thinking 
> that I am trying to compare which technology is the best or something 
> like that – I am sorry if that is the impression that you got from my 
> message. Please note that I am just trying to understand the concerns 
> beyond the existing technologies and to use this understanding to 
> identify new provenance requirements as stated in the group’s charter.
>
> I should say that our group meetings have been a great opportunity for 
> me to better understand provenance technologies other than PML and 
> this is exactly what I was trying to do when I asked you the technical 
> questions about OPM and p-structure. For instance, I didn’t know that 
> OPM was partially derived from p-structure.
>
> Cheers,
> Paulo.
>
>> Hi Paulo,
>>
>> I think we may have a misunderstanding on the goals of the Incubator
>> Group. Let me say what I think it is and people can correct me if I'm 
>> wrong:
>>
>> Quoting the Incubator Groups charter it says:
>>
>> The goal of this incubator group is to provide a state-of-the art
>> understanding and develop a roadmap in the area of provenance for
>> Semantic Web technologies, development, and possible standardization.
>> This includes:
>> - Developing requirements for representing explicit provenance
>> information of Semantic Web resources
>> - Developing use cases for accessing and reasoning about provenance
>> information
>> - Identifying the issues in provenance that are a direct concern to the
>> Semantic Web
>> - Identifying starting points for provenance representations
>> - Articulating the relationships between provenance on the Semantic Web
>> and ongoing work on trust and provenance in other areas
>> - Identifying elements of a provenance architecture on the Semantic Web
>> that need and would benefit from Standardization (eg, at the W3C)
>>
>>
>> We are not standardizing anything in this group, we are saying if/where
>> standardization would be beneficial to have. We also may suggest
>> starting points for any suggested standardization.
>>
>> In the gap analysis work, we are trying to say given the requirements
>> from the scenario, what are the gaps that need to be solved for that
>> scenario to be made possible. In some cases, these gaps point to the
>> need for standardization, for example, the lack of a common provenance
>> representation and the lack of common mechanism to access provenance. In
>> some cases this points to the requirement for more research/technical
>> work to be done, e.g. proven approaches for scalability.
>>
>> In the gap analysis, we are identifying gaps not suggesting technologies
>> and clearly we have identified gaps that are not in OPM.
>>
>> The mapping effort is a different line of work and I wouldn't want to
>> confuse the two pieces work. If you have questions about that I think we
>> should start a new email thread.
>>
>> Thanks,
>> Paul
>>
>>
>>
>> Paulo Pinheiro da Silva wrote:
>>> Dear Luc and Paul,
>>>
>>> Thank you very much for your comments.
>>>
>>> I consider the connection between derivation traces and information
>>> sources to be essential to support some of the identified requirements
>>> for a “common” representation of provenance. For instance, I cannot
>>> see how provenance can be used to support trust recommendations and
>>> result understanding if one cannot use provenance to trace back
>>> information sources used to derive a given result.
>>>
>>> In this case, I would like to refer back to my original message where
>>> I identified some aspects of PML that I believe are not covered in
>>> OPM, including in this case a supporting infrastructure.
>>>
>>> As a response to my message, Paul cited a European provenance project
>>> that according to Luc is based on p-streucture, which pre-dates OPM.
>>> The information that p-structure pre-dates OPM, however, is not enough
>>> for me to know whether p-structure provides a solution for connecting
>>> derivation traces to information sources that I cannot see in OPM of
>>> if it does, why the connection was not propagated to OPM. Also, it
>>> does not clarify the relation between p-structure/OPM and the
>>> technical issues in the original gap analysis.
>>>
>>> So, one may be asking what is the relevance of my questions above
>>> considering that we should focus in a common representation for
>>> provenance. I would like to remind the group’s effort of mapping other
>>> provenance notations to OPM, somehow implying that OPM provides the
>>> basic constructs to support a common provenance notation. So, which
>>> other important provenance aspects are we leaving out in our mapping?
>>> Also, how can we create a common understanding of provenance aspects
>>> not covered in OPM?
>>>
>>> Unfortunately I will not be able to attend the meeting this week. In
>>> fact, I am very much interested to know some answers for my questions.
>>>
>>> Many thanks,
>>> Paulo.
>>>
>>>> Paolo,
>>>>
>>>> The papers that Paul cited can be downloaded from www.pasoa.org (they
>>>> are also on Mendeley).
>>>> The p-structure model pre-dates OPM.
>>>>
>>>> Paul's point (I believe) is that many of the issues raised in the gap
>>>> analysis were
>>>> addressed by this model, but it is not widely deployed.
>>>>
>>>> Luc
>>>>
>>>> On 05/08/10 21:30, Paulo Pinheiro da Silva wrote:
>>>>> Hi Paul,
>>>>> Thank you very much for your prompt response.
>>>>>
>>>>> I am glad to see that we all agree that there is an urging need for a
>>>>> common provenance standard. My understanding is that the incubator
>>>>> group is paving the way for the development of such standard.
>>>>>
>>>>> Regarding your message, I would like to better understand the 
>>>>> technical
>>>>> aspects of your gap analysis and to learn from it. So, following your
>>>>> mention of the European provenance project please let me know the
>>>>> following:
>>>>>
>>>>> 1)Are you saying that one cannot see the technical issues of your
>>>>> gap analysis in the European provenance project?
>>>>>
>>>>> 2) If the answer for (1) is yes, how can we learn from this
>>>>> project?
>>>>>
>>>>> 3) Is OPM used in the European project?
>>>>>
>>>>> 4) If the answer for (3) is yes, I would like to understand how OPM
>>>>> artifacts are tight to sources, how sources are identified, and how
>>>>> provenance information about sources is represented;
>>>>>
>>>>> 5) If the answer for (3) is no, which provenance representation
>>>>> language is used?
>>>>>
>>>>> In other words, I need to better understand the gap analysis 
>>>>> (viz., the
>>>>> points behind the analysis) and I believe we should not start from 
>>>>> the
>>>>> assumption that we don’t know anything about provenance (that
>>>>> appears to
>>>>> be the motivation for us to write a related work section).
>>>>>
>>>>> Many thanks,
>>>>> Paulo.
>>>>>
>>>>> On 8/5/2010 1:08 PM, Paul Groth wrote:
>>>>>> Hi Paulo,
>>>>>>
>>>>>> Thanks for the message. I think the important thing here is the word
>>>>>> "common" in what I wrote. By way of illustration...
>>>>>>
>>>>>> As part of the EU Provenance Project [1], we also designed and
>>>>>> implemented an Architecture for Provenance Systems [2, 3]. This
>>>>>> architecture included a data model, the p-structure [4] that
>>>>>> allowed for
>>>>>> the distributed linking and storage of provenance. It specified
>>>>>> protocols for querying provenance information [5,6] and recording
>>>>>> it as
>>>>>> well. This was designed to work in a scalable setting [7].
>>>>>>
>>>>>> Obviously, I could go into more detail, this little description is
>>>>>> just
>>>>>> to point out that I _agree_ with you that there are solutions for 
>>>>>> many
>>>>>> of these problems. However, these solutions are _not_ common and
>>>>>> widely
>>>>>> deployed. Where widely deployed = things like trackbacks, html, and
>>>>>> probably dublin core and RDFa. The point is that while solutions 
>>>>>> exist
>>>>>> within the research community (and some in business), they are by no
>>>>>> means common or standard.
>>>>>>
>>>>>> This is exactly why, personally, I think the W3C should have a
>>>>>> standards
>>>>>> committee devoted to provenance. There are enough commonalities
>>>>>> between
>>>>>> provenance technologies that having a standard would help push
>>>>>> adoption
>>>>>> of provenance on the Web. Furthermore, without a standard it 
>>>>>> makes it
>>>>>> difficult to implement effectively something like the News 
>>>>>> Aggregator
>>>>>> Scenario over the whole of the web.
>>>>>>
>>>>>> Cheers,
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>> [1] http://www.gridprovenance.org
>>>>>> [2] http://eprints.ecs.soton.ac.uk/13216/
>>>>>> [3] Moreau, Luc and Groth, Paul and Miles, Simon and Vazquez,
>>>>>> Javier and
>>>>>> Jiang, Sheng and Munroe, Steve and Rana, Omer and Schreiber,
>>>>>> Andreas and
>>>>>> Tan, Victor and Varga, Laszlo (2007) The Provenance of Electronic
>>>>>> Data.
>>>>>> Communications of the ACM, 51 (4). pp. 52-58.
>>>>>> [4] Paul Groth, Simon Miles, and Luc Moreau. A Model of Process
>>>>>> Documentation to Determine Provenance in Mash-ups. Transactions on
>>>>>> Internet Technology (TOIT), 9(1):1-31, 2009.
>>>>>> [5] Simon Miles, Paul Groth, Steve Munroe, Sheng Jiang , Thibaut
>>>>>> Assandri, and Luc Moreau. Extracting Causal Graphs from an Open
>>>>>> Provenance Data Model. Concurrency and Computation: Practice and
>>>>>> Experience, 2007.
>>>>>> [6] Miles, Simon (2006) Electronically Querying for the 
>>>>>> Provenance of
>>>>>> Entities. In: Proceedings of the International Provenance and
>>>>>> Annotation
>>>>>> Workshop, May 2006, Chicago, USA.
>>>>>> [7] Groth, Paul and Miles, Simon and Fang, Weijian and Wong, 
>>>>>> Sylvia C.
>>>>>> and Moreau, Luc (2005) Recording and Using Provenance in a Protein
>>>>>> Compressibility Experiment. In: Proceedings of the 14th IEEE
>>>>>> International Symposium on High Performance Distributed Computing
>>>>>> (HPDC
>>>>>> 2005). Item not available online.
>>>>>>
>>>>>>
>>>>>> Paulo Pinheiro da Silva wrote:
>>>>>>> Paul-- Thank you very much for your message.
>>>>>>>
>>>>>>> All-- I agree with Paul’s statement that there is not 
>>>>>>> well-establish
>>>>>>> guidelines for using/adopting provenance solutions and this is a 
>>>>>>> part
>>>>>>> of his message that I would like to see further discussion.
>>>>>>> I like Luc’s suggestion of discussing these gaps in terms of 
>>>>>>> queries.
>>>>>>> For instance, if you go to
>>>>>>>
>>>>>>> http://trust.utep.edu/sparql-pml/query/example
>>>>>>>
>>>>>>> you will see a large collection of sparql-pml queries answering 
>>>>>>> many
>>>>>>> of the questions that require bridging the gaps identified in 
>>>>>>> Paul’s
>>>>>>> message. Please note that the queries in the URL above are standard
>>>>>>> SPARQL queries based on the use of PML vocabulary. The results
>>>>>>> used in
>>>>>>> the URL come from a repository of PML provenance knowledge in the
>>>>>>> domains of earth science (using actual NSF Earthscope and IRIS
>>>>>>> data in
>>>>>>> support of seismology and USGS data in support of earth magnetism),
>>>>>>> astronomy (using actual NCAR data in support of space weather), and
>>>>>>> logical proofs in support of TPTP. Anyone can actually go to
>>>>>>> http://trust.utep.edu/sparql-pml/query/index and write your own
>>>>>>> queries or use the basic or advanced use interface
>>>>>>> (http://trust.utep.edu/sparql-pml/search/index). [the SPARQL-PML
>>>>>>> queries have been developed by Jitin Arora
>>>>>>> (http://trust.utep.edu/~jarora/)]
>>>>>>>
>>>>>>> I would like to emphasize two aspects of PML that may need to be
>>>>>>> highlighted so that the group can further appreciate our work and
>>>>>>> understand how PML bridges the technical gaps in Paul’s message:
>>>>>>>
>>>>>>> 1) PML is a collection of three ontologies: PML-Provenance
>>>>>>> (or
>>>>>>> PML-P), PML-Justification (PML-J) and PML-Trust (PML-T). In this
>>>>>>> case, most of the provenance concepts in OPM map into concepts
>>>>>>> described in PML-J ontology. This means that most of the 
>>>>>>> elements in
>>>>>>> PML-P are concepts not covered in OPM. I will go further and say 
>>>>>>> that
>>>>>>> many of these concepts have the role of tying artifacts to 
>>>>>>> sources as
>>>>>>> identified in Paul’s message;
>>>>>>>
>>>>>>> 2) If you revisit our publications, for instance [1], you
>>>>>>> will
>>>>>>> see that PML is just a component (the language component) of a 
>>>>>>> bigger
>>>>>>> infrastructure called Inference Web
>>>>>>> http://inference-web.org
>>>>>>>
>>>>>>> In fact, most of the concerns highlighted by Luc in his message 
>>>>>>> about
>>>>>>> having a well-defined API, services and other infrastructural
>>>>>>> features
>>>>>>> in support of provenance are exactly the kinds of things that one
>>>>>>> should be able to see in the Inference Web.
>>>>>>>
>>>>>>> With (1) and (2) in mind, I would like to stress one point: most of
>>>>>>> the provenance infrastructure mentioned in (2) is in support of
>>>>>>> PML-P.
>>>>>>> In fact, PML-P is the part of the provenance that gets reused 
>>>>>>> across
>>>>>>> multiple justification traces and as such needs to be discovered,
>>>>>>> aligned, augmented, etc. Further, one of our major mistakes was to
>>>>>>> put a lot of effort trying to come up with a registration mechanism
>>>>>>> for PML-P documents called IW-Base [2]. Later on, after a meeting
>>>>>>> with
>>>>>>> Tim Berners-Lee and his W3C team, we learned that we would need to
>>>>>>> distribute this approach, reason why we developed an Inference Web
>>>>>>> search mechanism for provenance called IWSearch [3]. Again, anyone
>>>>>>> can try IWSearch at http://onto.rpi.edu/iwsearch/
>>>>>>>
>>>>>>> I would like to say that PML and Inference Web were developed from
>>>>>>> day
>>>>>>> 1 to support "linking provenance between sites (i.e. trackback but
>>>>>>> for
>>>>>>> the whole web).: That is the reason why PML has always had the
>>>>>>> following properties:
>>>>>>>
>>>>>>> a) PML identifiers are URIs
>>>>>>> b) PML content is in RDF/OWL (used to be in DAML+OIL before OWL)
>>>>>>> c) PML justifications are combinable/decomposable [4]
>>>>>>> d) RDF/OWL links are used to connect PML documents
>>>>>>>
>>>>>>> Another point that I would like to make is that the PML-P part 
>>>>>>> of PML
>>>>>>> is the one where we connect to many other well-known pieces of
>>>>>>> information that we have discussed in this group. For instance, 
>>>>>>> when
>>>>>>> it comes to publications, PML-P defines a publication as a kind of
>>>>>>> information source and is where we connect PML to Dublin Core
>>>>>>> attributes for publications.
>>>>>>>
>>>>>>> As you see, I have reasons to be uncomfortable with statements that
>>>>>>> there is no language “for expressing provenance information that
>>>>>>> captures processes as well as the other content dimensions” or "API
>>>>>>> for obtaining/querying provenance information" or " for linking
>>>>>>> provenance between sites (i.e. trackback but for the whole web)".
>>>>>>>
>>>>>>> Regarding this month of August, I am unfortunately unable to attend
>>>>>>> the meeting this week and next week (will be flying during the
>>>>>>> time of
>>>>>>> the meetings). Also, I believe Deborah will not attend as well 
>>>>>>> due to
>>>>>>> personal reasons. Thus, I am asking Tim Lebo from RPI to 
>>>>>>> represent us
>>>>>>> and to collect any request you may have from PML so that we can
>>>>>>> address them later in case Tim cannot answer your questions right
>>>>>>> away.
>>>>>>>
>>>>>>> Many thanks,
>>>>>>> Paulo.
>>>>>>>
>>>>>>> [the publications below are part of the provenance collection in
>>>>>>> Mendeley]
>>>>>>>
>>>>>>> [1] Deborah L. McGuinness and Paulo Pinheiro da Silva. Explaining
>>>>>>> Answers from the Semantic Web: The Inference Web Approach. 
>>>>>>> Journal of
>>>>>>> Web Semantics, Vol. 1 No. 4. October 2004, pages 397-413.
>>>>>>>
>>>>>>> [2] Deborah L. McGuinness, Paulo Pinheiro da Silva, Cynthia Chang.
>>>>>>> IWBase: Provenance Metadata Infrastructure for Explaining and
>>>>>>> Trusting
>>>>>>> Answers from the Web. Technical Report KSL-04-07, Knowledge Systems
>>>>>>> Laboratory, Stanford University, USA, 2004.
>>>>>>>
>>>>>>> [3] Paulo Pinheiro da Silva, Geoff Sutcliffe, Cynthia Chang, Li 
>>>>>>> Ding,
>>>>>>> Nick del Rio and Deborah McGuinness. Presenting TSTP Proofs with
>>>>>>> Inference Web Tools. In Proceedings of IJCAR '08 Workshop on
>>>>>>> Practical
>>>>>>> Aspects of Automated Reasoning (PAAR-2008), August 2008, Sydney,
>>>>>>> Australia.
>>>>>>>
>>>>>>> [4] Paulo Pinheiro da Silva and Deborah L. McGuinness. Combinable
>>>>>>> Proof Fragments for the Web. Technical Report KSL-03-04, Knowledge
>>>>>>> Systems Laboratory, Stanford University, USA, 2003.
>>>>>>>
>>>>>>> ([3] is not a paper specifically about IWSearch although it briefly
>>>>>>> describes the tool)
>>>>>>>
>>>>>>>> Thanks Paul for this proposal for the gap analysis.
>>>>>>>> Twice you mention 'exposing' and i thought we could introduce
>>>>>>>> 'querying'
>>>>>>>> provenance too.
>>>>>>>>
>>>>>>>> Also, maybe the gaps could be structured in content vs apis.
>>>>>>>> Like this, maybe.
>>>>>>>>
>>>>>>>>
>>>>>>>> Content:
>>>>>>>> - No common standard for expressing provenance information that
>>>>>>>> captures
>>>>>>>> processes as well as the other content dimensions.
>>>>>>>> - No guidance for how existing standards can be put together to
>>>>>>>> provide
>>>>>>>> provenance (e.g. linking to identity).
>>>>>>>>
>>>>>>>> APIs (or protocols):
>>>>>>>> - No common API for obtaining/querying provenance information
>>>>>>>> - No guidance for how application developers should go about
>>>>>>>> exposing
>>>>>>>> provenance in their web systems.
>>>>>>>> - No well-defined standard for linking provenance between sites
>>>>>>>> (i.e.
>>>>>>>> trackback but for the whole web).
>>>>>>>>
>>>>>>>>
>>>>>>>> I also wondered whether they should be structured according to the
>>>>>>>> provenance dimensions (so instead of API, break
>>>>>>>> this into Use/Management).
>>>>>>>>
>>>>>>>> Luc
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 08/02/2010 12:04 PM, Paul Groth wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> As discussed at last week's telecon, I came up with some ideas
>>>>>>>>> about
>>>>>>>>> the gaps necessary to realize the News Aggregator Scenario. I've
>>>>>>>>> put
>>>>>>>>> these in the wiki and I append them below to help start the
>>>>>>>>> discussion. Let me know what you think.
>>>>>>>>>
>>>>>>>>> Gap Analysis- News Aggregator
>>>>>>>>>
>>>>>>>>> For each step within the News Aggregator scenario, there are
>>>>>>>>> existing
>>>>>>>>> technologies or relevant research that could solve that step. For
>>>>>>>>> example, once can properly insert licensing information into a
>>>>>>>>> photo
>>>>>>>>> using a creative commons license and the Extensible Metadata
>>>>>>>>> Platform.
>>>>>>>>> One can track the origin of tweets either through retweets or 
>>>>>>>>> using
>>>>>>>>> some extraction technologies within twitter. However, the
>>>>>>>>> problem is
>>>>>>>>> that across multiple sites there is no common format and api to
>>>>>>>>> access
>>>>>>>>> and understand provenance information whether it is explicitly or
>>>>>>>>> implicitly determined. To inquire about retweets or inquire about
>>>>>>>>> trackbacks one needs to use different apis and understand 
>>>>>>>>> different
>>>>>>>>> formats. Furthermore, there is no (widely deployed) mechanism to
>>>>>>>>> point
>>>>>>>>> to provenance information on another site. For example, once a
>>>>>>>>> tweet
>>>>>>>>> is traced to the end of twitter there is no way to follow where
>>>>>>>>> that
>>>>>>>>> tweet came from.
>>>>>>>>>
>>>>>>>>> Systems largely do not document the software by which changes 
>>>>>>>>> were
>>>>>>>>> made to data and what those pieces of software did to data.
>>>>>>>>> However,
>>>>>>>>> there are existing technologies that allow this to be done. For
>>>>>>>>> example, in a domain specific setting, XMP allows the
>>>>>>>>> transformations
>>>>>>>>> of images to be documented. More general formats such as OPM,
>>>>>>>>> and PML
>>>>>>>>> allow this to be expressed but are not currently widely deployed.
>>>>>>>>>
>>>>>>>>> Finally, while many sites provide for identity and their are
>>>>>>>>> several
>>>>>>>>> widely deployed standards for identity (OpenId), there are no
>>>>>>>>> existing
>>>>>>>>> mechanisms for tying identity to objects or provenance traces. 
>>>>>>>>> This
>>>>>>>>> directly ties to the attribution of objects and provenance.
>>>>>>>>>
>>>>>>>>> Summing up there are 4 existing gaps to realizing the News
>>>>>>>>> Aggregator
>>>>>>>>> scenario:
>>>>>>>>>
>>>>>>>>> - No common standard to target for exposing and expressing
>>>>>>>>> provenance
>>>>>>>>> information that captures processes as well as the other content
>>>>>>>>> dimensions.
>>>>>>>>> - No well-defined standard for linking provenance between sites
>>>>>>>>> (i.e.
>>>>>>>>> trackback but for the whole web).
>>>>>>>>> - No guidance for how exisiting standards can be put together to
>>>>>>>>> provide provenance (e.g. linking to identity).
>>>>>>>>> - No guidance for how application developers should go about
>>>>>>>>> exposing
>>>>>>>>> provenance in there web systems. 
>>>>>> . 
>>>> . 
>> .
Received on Friday, 6 August 2010 11:42:46 UTC