Re: gap analysis (input regarding PML) from Paulo Pinheiro da Silva on 2010-08-06 (public-xg-prov@w3.org from August 2010)

From: Paulo Pinheiro da Silva <paulo@utep.edu>
Date: Fri, 6 Aug 2010 03:13:10 -0600
To: Luc Moreau <L.Moreau@ecs.soton.ac.uk>
CC: Paul Groth <pgroth@gmail.com>, "public-xg-prov@w3.org" <public-xg-prov@w3.org>, "Arora, Jitin BTE" <jarora@miners.utep.edu>, Tim Lebo <lebot@rpi.edu>, "Deborah L. McGuinness" <dlm@cs.rpi.edu>
Message-ID: <4C5BD226.9040602@utep.edu>
Dear Luc and Paul,

Thank you very much for your comments.

I consider the connection between derivation traces and information 
sources to be essential to support some of the identified requirements 
for a “common” representation of provenance. For instance, I cannot see 
how provenance can be used to support trust recommendations and result 
understanding if one cannot use provenance to trace back information 
sources used to derive a given result.

In this case, I would like to refer back to my original message where I 
identified some aspects of PML that I believe are not covered in OPM, 
including in this case a supporting infrastructure.

As a response to my message, Paul cited a European provenance project 
that according to Luc is based on p-streucture, which pre-dates OPM. The 
information that p-structure pre-dates OPM, however, is not enough for 
me to know whether p-structure provides a solution for connecting 
derivation traces to information sources that I cannot see in OPM of if 
it does, why the connection was not propagated to OPM. Also, it does not 
clarify the relation between p-structure/OPM and the technical issues in 
the original gap analysis.

So, one may be asking what is the relevance of my questions above 
considering that we should focus in a common representation for 
provenance. I would like to remind the group’s effort of mapping other 
provenance notations to OPM, somehow implying that OPM provides the 
basic constructs to support a common provenance notation. So, which 
other important provenance aspects are we leaving out in our mapping? 
Also, how can we create a common understanding of provenance aspects not 
covered in OPM?

Unfortunately I will not be able to attend the meeting this week. In 
fact, I am very much interested to know some answers for my questions.

Many thanks,
Paulo.

> Paolo,
>
> The papers that Paul cited can be downloaded from www.pasoa.org (they
> are also on Mendeley).
> The p-structure model pre-dates OPM.
>
> Paul's point (I believe) is that many of the issues raised in the gap
> analysis were
> addressed by this model, but it is not widely deployed.
>
> Luc
>
> On 05/08/10 21:30, Paulo Pinheiro da Silva wrote:
>> Hi Paul,
>> Thank you very much for your prompt response.
>>
>> I am glad to see that we all agree that there is an urging need for a
>> common provenance standard. My  understanding is that the incubator
>> group is paving the way for the development of such standard.
>>
>> Regarding your message, I would like to better understand the technical
>> aspects of your gap analysis and to learn from it.  So, following your
>> mention of the European provenance project please let me know the
>> following:
>>
>>     1)Are you saying that one cannot see the technical issues of your
>> gap analysis in the European provenance project?
>>
>>     2) If the answer for (1) is yes, how can we learn from this project?
>>
>>     3) Is OPM used in the European project?
>>
>>     4) If the answer for (3) is yes, I would like to understand how OPM
>> artifacts are tight to sources, how sources are identified, and how
>> provenance information about sources is represented;
>>
>>     5) If the answer for (3) is no, which provenance representation
>> language is used?
>>
>> In other words, I need to better understand the gap analysis (viz., the
>> points behind the analysis) and I believe we should not start from the
>> assumption that we don’t know anything about provenance (that appears to
>> be the motivation for us to write a related work section).
>>
>> Many thanks,
>> Paulo.
>>
>> On 8/5/2010 1:08 PM, Paul Groth wrote:
>>> Hi Paulo,
>>>
>>> Thanks for the message. I think the important thing here is the word
>>> "common" in what I wrote. By way of illustration...
>>>
>>> As part of the  EU Provenance Project [1], we also designed and
>>> implemented an Architecture for Provenance Systems [2, 3]. This
>>> architecture included a data model, the p-structure [4] that allowed for
>>> the distributed linking and storage of provenance. It specified
>>> protocols for querying provenance information [5,6] and recording it as
>>> well. This was designed to work in a scalable setting [7].
>>>
>>> Obviously, I could go into more detail, this little description is just
>>> to point out that I _agree_ with you that there are solutions for many
>>> of these problems. However, these solutions are _not_ common and widely
>>> deployed. Where widely deployed = things like trackbacks, html, and
>>> probably dublin core and RDFa. The point is that while solutions exist
>>> within the research  community (and some in business), they are by no
>>> means common or standard.
>>>
>>> This is exactly why, personally, I think the W3C should have a standards
>>> committee devoted to provenance. There are enough commonalities between
>>> provenance technologies that having a standard would help push adoption
>>> of provenance on the Web. Furthermore, without a standard it makes it
>>> difficult to implement effectively something like the News Aggregator
>>> Scenario over the whole of the web.
>>>
>>> Cheers,
>>> Paul
>>>
>>>
>>> [1] http://www.gridprovenance.org
>>> [2] http://eprints.ecs.soton.ac.uk/13216/
>>> [3] Moreau, Luc and Groth, Paul and Miles, Simon and Vazquez, Javier and
>>> Jiang, Sheng and Munroe, Steve and Rana, Omer and Schreiber, Andreas and
>>> Tan, Victor and Varga, Laszlo (2007) The Provenance of Electronic Data.
>>> Communications of the ACM, 51 (4). pp. 52-58.
>>> [4] Paul Groth, Simon Miles, and Luc Moreau. A Model of Process
>>> Documentation to Determine Provenance in Mash-ups. Transactions on
>>> Internet Technology (TOIT), 9(1):1-31, 2009.
>>> [5] Simon Miles, Paul Groth, Steve Munroe, Sheng Jiang , Thibaut
>>> Assandri, and Luc Moreau. Extracting Causal Graphs from an Open
>>> Provenance Data Model. Concurrency and Computation: Practice and
>>> Experience, 2007.
>>> [6] Miles, Simon (2006) Electronically Querying for the Provenance of
>>> Entities. In: Proceedings of the International Provenance and Annotation
>>> Workshop, May 2006, Chicago, USA.
>>> [7] Groth, Paul and Miles, Simon and Fang, Weijian and Wong, Sylvia C.
>>> and Moreau, Luc (2005) Recording and Using Provenance in a Protein
>>> Compressibility Experiment. In: Proceedings of the 14th IEEE
>>> International Symposium on High Performance Distributed Computing (HPDC
>>> 2005). Item not available online.
>>>
>>>
>>> Paulo Pinheiro da Silva wrote:
>>>> Paul-- Thank you very much for your message.
>>>>
>>>> All-- I agree with Paul’s statement that there is not well-establish
>>>> guidelines for using/adopting provenance solutions and this is a part
>>>> of his message that I would like to see further discussion.
>>>> I like Luc’s suggestion of discussing these gaps in terms of queries.
>>>> For instance, if you go to
>>>>
>>>>       http://trust.utep.edu/sparql-pml/query/example
>>>>
>>>> you will see a large collection of sparql-pml queries answering many
>>>> of the questions that require bridging the gaps identified in Paul’s
>>>> message. Please note that the queries in the URL above are standard
>>>> SPARQL queries based on the use of PML vocabulary. The results used in
>>>> the URL come from a repository of PML provenance knowledge in the
>>>> domains of earth science (using actual NSF Earthscope and IRIS data in
>>>> support of seismology and USGS data in support of earth magnetism),
>>>> astronomy (using actual NCAR data in support of space weather), and
>>>> logical proofs in support of TPTP.  Anyone can actually go to
>>>> http://trust.utep.edu/sparql-pml/query/index and write your own
>>>> queries or use the basic or advanced use interface
>>>> (http://trust.utep.edu/sparql-pml/search/index). [the SPARQL-PML
>>>> queries have been developed by Jitin Arora
>>>> (http://trust.utep.edu/~jarora/)]
>>>>
>>>> I would like to emphasize two aspects of PML that may need to be
>>>> highlighted so that the group can further appreciate our work and
>>>> understand how PML bridges the technical gaps in Paul’s message:
>>>>
>>>>         1) PML is a collection of three ontologies: PML-Provenance (or
>>>> PML-P), PML-Justification (PML-J)  and PML-Trust (PML-T). In this
>>>> case, most of the provenance concepts in OPM map into concepts
>>>> described in PML-J ontology. This means that most of the elements in
>>>> PML-P are concepts not covered in OPM. I will go further and say that
>>>> many of these concepts have the role of tying artifacts to sources as
>>>> identified in Paul’s message;
>>>>
>>>>         2) If you revisit our publications, for instance [1], you will
>>>> see that PML is just a component (the language component) of a bigger
>>>> infrastructure called Inference Web
>>>> http://inference-web.org
>>>>
>>>> In fact, most of the concerns highlighted by Luc in his message about
>>>> having a well-defined API, services and other infrastructural features
>>>> in support of provenance are exactly the kinds of things that one
>>>> should be able to see in the Inference Web.
>>>>
>>>> With (1) and (2) in mind, I would like to stress one point: most of
>>>> the provenance infrastructure mentioned in (2) is in support of PML-P.
>>>> In fact, PML-P is the part of the provenance that gets reused across
>>>> multiple justification traces and as such needs to be discovered,
>>>> aligned, augmented, etc.  Further, one of our major mistakes was to
>>>> put a lot of effort trying to come up with a registration mechanism
>>>> for PML-P documents called IW-Base [2]. Later on, after a meeting with
>>>> Tim Berners-Lee and his W3C team, we learned that we would need to
>>>> distribute this approach, reason why we developed an Inference Web
>>>> search mechanism for provenance called IWSearch [3].  Again, anyone
>>>> can try IWSearch at http://onto.rpi.edu/iwsearch/
>>>>
>>>> I would like to say that PML and Inference Web were developed from day
>>>> 1 to support "linking provenance between sites (i.e. trackback but for
>>>> the whole web).: That is the reason why PML has always had the
>>>> following properties:
>>>>
>>>> a) PML identifiers are URIs
>>>> b) PML content is in RDF/OWL (used to be in DAML+OIL before OWL)
>>>> c) PML justifications are combinable/decomposable [4]
>>>> d) RDF/OWL links are used to connect PML documents
>>>>
>>>> Another point that I would like to make is that the PML-P part of PML
>>>> is the one where we connect to many other well-known pieces of
>>>> information that we have discussed in this group. For instance, when
>>>> it comes to publications, PML-P defines a publication as a kind of
>>>> information source and is where we connect PML to Dublin Core
>>>> attributes for publications.
>>>>
>>>> As you see, I have reasons to be uncomfortable with statements that
>>>> there is no language “for expressing provenance information that
>>>> captures processes as well as the other content dimensions” or "API
>>>> for obtaining/querying provenance information" or " for linking
>>>> provenance between sites (i.e. trackback but for the whole web)".
>>>>
>>>> Regarding this month of August, I am unfortunately unable to attend
>>>> the meeting this week and next week (will be flying during the time of
>>>> the meetings). Also, I believe Deborah will not attend as well due to
>>>> personal reasons. Thus, I am asking Tim Lebo from RPI to represent us
>>>> and to collect any request you may have from PML so that we can
>>>> address them later in case Tim cannot answer your questions right away.
>>>>
>>>> Many thanks,
>>>> Paulo.
>>>>
>>>> [the publications below are part of the provenance collection in
>>>> Mendeley]
>>>>
>>>> [1] Deborah L. McGuinness and Paulo Pinheiro da Silva. Explaining
>>>> Answers from the Semantic Web: The Inference Web Approach. Journal of
>>>> Web Semantics, Vol. 1 No. 4. October 2004, pages 397-413.
>>>>
>>>> [2] Deborah L. McGuinness, Paulo Pinheiro da Silva, Cynthia Chang.
>>>> IWBase: Provenance Metadata Infrastructure for Explaining and Trusting
>>>> Answers from the Web. Technical Report KSL-04-07, Knowledge Systems
>>>> Laboratory, Stanford University, USA, 2004.
>>>>
>>>> [3] Paulo Pinheiro da Silva, Geoff Sutcliffe, Cynthia Chang, Li Ding,
>>>> Nick del Rio and Deborah McGuinness. Presenting TSTP Proofs with
>>>> Inference Web Tools. In Proceedings of IJCAR '08 Workshop on Practical
>>>> Aspects of Automated Reasoning (PAAR-2008), August 2008, Sydney,
>>>> Australia.
>>>>
>>>> [4] Paulo Pinheiro da Silva and Deborah L. McGuinness. Combinable
>>>> Proof Fragments for the Web. Technical Report KSL-03-04, Knowledge
>>>> Systems Laboratory, Stanford University, USA, 2003.
>>>>
>>>> ([3] is not a paper specifically about IWSearch although  it briefly
>>>> describes the tool)
>>>>
>>>>> Thanks Paul for this proposal for the gap analysis.
>>>>> Twice you mention 'exposing' and i thought we could introduce
>>>>> 'querying'
>>>>> provenance too.
>>>>>
>>>>> Also, maybe the gaps could be structured in content vs apis.
>>>>> Like this, maybe.
>>>>>
>>>>>
>>>>> Content:
>>>>> - No common standard for expressing provenance information that
>>>>> captures
>>>>> processes as well as the other content dimensions.
>>>>> - No guidance for how existing standards can be put together to
>>>>> provide
>>>>> provenance (e.g. linking to identity).
>>>>>
>>>>> APIs (or protocols):
>>>>> - No common API for obtaining/querying provenance information
>>>>> - No guidance for how application developers should go about exposing
>>>>> provenance in their web systems.
>>>>> - No well-defined standard for linking provenance between sites (i.e.
>>>>> trackback but for the whole web).
>>>>>
>>>>>
>>>>> I also wondered whether they should be structured according to the
>>>>> provenance dimensions (so instead of API, break
>>>>> this into Use/Management).
>>>>>
>>>>> Luc
>>>>>
>>>>>
>>>>>
>>>>> On 08/02/2010 12:04 PM, Paul Groth wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> As discussed at last week's telecon, I came up with some ideas about
>>>>>> the gaps necessary to realize the News Aggregator Scenario. I've put
>>>>>> these in the wiki and I append them below to help start the
>>>>>> discussion. Let me know what you think.
>>>>>>
>>>>>> Gap Analysis- News Aggregator
>>>>>>
>>>>>> For each step within the News Aggregator scenario, there are existing
>>>>>> technologies or relevant research that could solve that step. For
>>>>>> example, once can properly insert licensing information into a photo
>>>>>> using a creative commons license and the Extensible Metadata
>>>>>> Platform.
>>>>>> One can track the origin of tweets either through retweets or using
>>>>>> some extraction technologies within twitter. However, the problem is
>>>>>> that across multiple sites there is no common format and api to
>>>>>> access
>>>>>> and understand provenance information whether it is explicitly or
>>>>>> implicitly determined. To inquire about retweets or inquire about
>>>>>> trackbacks one needs to use different apis and understand different
>>>>>> formats. Furthermore, there is no (widely deployed) mechanism to
>>>>>> point
>>>>>> to provenance information on another site. For example, once a tweet
>>>>>> is traced to the end of twitter there is no way to follow where that
>>>>>> tweet came from.
>>>>>>
>>>>>> Systems largely do not document the software by which changes were
>>>>>> made to data and what those pieces of software did to data. However,
>>>>>> there are existing technologies that allow this to be done. For
>>>>>> example, in a domain specific setting, XMP allows the transformations
>>>>>> of images to be documented. More general formats such as OPM, and PML
>>>>>> allow this to be expressed but are not currently widely deployed.
>>>>>>
>>>>>> Finally, while many sites provide for identity and their are several
>>>>>> widely deployed standards for identity (OpenId), there are no
>>>>>> existing
>>>>>> mechanisms for tying identity to objects or provenance traces. This
>>>>>> directly ties to the attribution of objects and provenance.
>>>>>>
>>>>>> Summing up there are 4 existing gaps to realizing the News Aggregator
>>>>>> scenario:
>>>>>>
>>>>>> - No common standard to target for exposing and expressing provenance
>>>>>> information that captures processes as well as the other content
>>>>>> dimensions.
>>>>>> - No well-defined standard for linking provenance between sites (i.e.
>>>>>> trackback but for the whole web).
>>>>>> - No guidance for how exisiting standards can be put together to
>>>>>> provide provenance (e.g. linking to identity).
>>>>>> - No guidance for how application developers should go about exposing
>>>>>> provenance in there web systems.
>>> .
>>>
>>
> .
>
Received on Friday, 6 August 2010 09:13:44 UTC