Re: gap analysis (input regarding PML) from Paulo Pinheiro da Silva on 2010-08-05 (public-xg-prov@w3.org from August 2010)

From: Paulo Pinheiro da Silva <paulo@utep.edu>
Date: Thu, 5 Aug 2010 12:30:14 -0600
To: Luc Moreau <L.Moreau@ecs.soton.ac.uk>, Paul Groth <pgroth@gmail.com>
CC: "public-xg-prov@w3.org" <public-xg-prov@w3.org>, Jitin Arora <jarora@miners.utep.edu>, Tim Lebo <lebot@rpi.edu>, "Deborah L. McGuinness" <dlm@cs.rpi.edu>
Message-ID: <4C5B0336.80003@utep.edu>
Paul-- Thank you very much for your message.

All-- I agree with Paul’s statement that there is not well-establish 
guidelines for using/adopting provenance solutions and this is a part of 
his message that I would like to see further discussion.
I like Luc’s suggestion of discussing these gaps in terms of queries. 
For instance, if you go to

	http://trust.utep.edu/sparql-pml/query/example

you will see a large collection of sparql-pml queries answering many of 
the questions that require bridging the gaps identified in Paul’s 
message. Please note that the queries in the URL above are standard 
SPARQL queries based on the use of PML vocabulary. The results used in 
the URL come from a repository of PML provenance knowledge in the 
domains of earth science (using actual NSF Earthscope and IRIS data in 
support of seismology and USGS data in support of earth magnetism), 
astronomy (using actual NCAR data in support of space weather), and 
logical proofs in support of TPTP.  Anyone can actually go to 
http://trust.utep.edu/sparql-pml/query/index and write your own queries 
or use the basic or advanced use interface 
(http://trust.utep.edu/sparql-pml/search/index). [the SPARQL-PML queries 
have been developed by Jitin Arora (http://trust.utep.edu/~jarora/)]

I would like to emphasize two aspects of PML that may need to be 
highlighted so that the group can further appreciate our work and 
understand how PML bridges the technical gaps in Paul’s message:

       1) PML is a collection of three ontologies: PML-Provenance (or 
PML-P), PML-Justification (PML-J)  and PML-Trust (PML-T). In this case, 
most of the provenance concepts in OPM map into concepts described in 
PML-J ontology. This means that most of the elements in PML-P are 
concepts not covered in OPM. I will go further and say that many of 
these concepts have the role of tying artifacts to sources as identified 
in Paul’s message;

       2) If you revisit our publications, for instance [1], you will 
see that PML is just a component (the language component) of a bigger 
infrastructure called Inference Web
http://inference-web.org

In fact, most of the concerns highlighted by Luc in his message about 
having a well-defined API, services and other infrastructural features 
in support of provenance are exactly the kinds of things that one should 
be able to see in the Inference Web.

With (1) and (2) in mind, I would like to stress one point: most of the 
provenance infrastructure mentioned in (2) is in support of PML-P. In 
fact, PML-P is the part of the provenance that gets reused across 
multiple justification traces and as such needs to be discovered, 
aligned, augmented, etc.  Further, one of our major mistakes was to put 
a lot of effort trying to come up with a registration mechanism for 
PML-P documents called IW-Base [2]. Later on, after a meeting with Tim 
Berners-Lee and his W3C team, we learned that we would need to 
distribute this approach, reason why we developed an Inference Web 
search mechanism for provenance called IWSearch [3].  Again, anyone can 
try IWSearch at http://onto.rpi.edu/iwsearch/

I would like to say that PML and Inference Web were developed from day 1 
to support "linking provenance between sites (i.e. trackback but for the 
whole web).: That is the reason why PML has always had the following 
properties:

a) PML identifiers are URIs
b) PML content is in RDF/OWL (used to be in DAML+OIL before OWL)
c) PML justifications are combinable/decomposable [4]
d) RDF/OWL links are used to connect PML documents

Another point that I would like to make is that the PML-P part of PML is 
the one where we connect to many other well-known pieces of information 
that we have discussed in this group. For instance, when it comes to 
publications, PML-P defines a publication as a kind of information 
source and is where we connect PML to Dublin Core attributes for 
publications.

As you see, I have reasons to be uncomfortable with statements that 
there is no language “for expressing provenance information that 
captures processes as well as the other content dimensions” or "API for 
obtaining/querying provenance information" or " for linking provenance 
between sites (i.e. trackback but for the whole web)".

Regarding this month of August, I am unfortunately unable to attend the 
meeting this week and next week (will be flying during the time of the 
meetings). Also, I believe Deborah will not attend as well due to 
personal reasons. Thus, I am asking Tim Lebo from RPI to represent us 
and to collect any request you may have from PML so that we can address 
them later in case Tim cannot answer your questions right away.

Many thanks,
Paulo.

[the publications below are part of the provenance collection in Mendeley]

[1] Deborah L. McGuinness and Paulo Pinheiro da Silva. Explaining 
Answers from the Semantic Web: The Inference Web Approach. Journal of 
Web Semantics, Vol. 1 No. 4. October 2004, pages 397-413.

[2] Deborah L. McGuinness, Paulo Pinheiro da Silva, Cynthia Chang. 
IWBase: Provenance Metadata Infrastructure for Explaining and Trusting 
Answers from the Web. Technical Report KSL-04-07, Knowledge Systems 
Laboratory, Stanford University, USA, 2004.

[3] Paulo Pinheiro da Silva, Geoff Sutcliffe, Cynthia Chang, Li Ding, 
Nick del Rio and Deborah McGuinness. Presenting TSTP Proofs with 
Inference Web Tools. In Proceedings of IJCAR '08 Workshop on Practical 
Aspects of Automated Reasoning (PAAR-2008), August 2008, Sydney, Australia.

[4] Paulo Pinheiro da Silva and Deborah L. McGuinness. Combinable Proof 
Fragments for the Web. Technical Report KSL-03-04, Knowledge Systems 
Laboratory, Stanford University, USA, 2003.

([3] is not a paper specifically about IWSearch although  it briefly 
describes the tool)

> Thanks Paul for this proposal for the gap analysis.
> Twice you mention 'exposing' and i thought we could introduce 'querying'
> provenance too.
>
> Also, maybe the gaps could be structured in content vs apis.
> Like this, maybe.
>
>
> Content:
> - No common standard for expressing provenance information that captures
> processes as well as the other content dimensions.
> - No guidance for how existing standards can be put together to provide
> provenance (e.g. linking to identity).
>
> APIs (or protocols):
> - No common API for obtaining/querying provenance information
> - No guidance for how application developers should go about exposing
> provenance in their web systems.
> - No well-defined standard for linking provenance between sites (i.e.
> trackback but for the whole web).
>
>
> I also wondered whether they should be structured according to the
> provenance dimensions (so instead of API, break
> this into Use/Management).
>
> Luc
>
>
>
> On 08/02/2010 12:04 PM, Paul Groth wrote:
>> Hi All,
>>
>> As discussed at last week's telecon, I came up with some ideas about
>> the gaps necessary to realize the News Aggregator Scenario. I've put
>> these in the wiki and I append them below to help start the
>> discussion. Let me know what you think.
>>
>> Gap Analysis- News Aggregator
>>
>> For each step within the News Aggregator scenario, there are existing
>> technologies or relevant research that could solve that step. For
>> example, once can properly insert licensing information into a photo
>> using a creative commons license and the Extensible Metadata Platform.
>> One can track the origin of tweets either through retweets or using
>> some extraction technologies within twitter. However, the problem is
>> that across multiple sites there is no common format and api to access
>> and understand provenance information whether it is explicitly or
>> implicitly determined. To inquire about retweets or inquire about
>> trackbacks one needs to use different apis and understand different
>> formats. Furthermore, there is no (widely deployed) mechanism to point
>> to provenance information on another site. For example, once a tweet
>> is traced to the end of twitter there is no way to follow where that
>> tweet came from.
>>
>> Systems largely do not document the software by which changes were
>> made to data and what those pieces of software did to data. However,
>> there are existing technologies that allow this to be done. For
>> example, in a domain specific setting, XMP allows the transformations
>> of images to be documented. More general formats such as OPM, and PML
>> allow this to be expressed but are not currently widely deployed.
>>
>> Finally, while many sites provide for identity and their are several
>> widely deployed standards for identity (OpenId), there are no existing
>> mechanisms for tying identity to objects or provenance traces. This
>> directly ties to the attribution of objects and provenance.
>>
>> Summing up there are 4 existing gaps to realizing the News Aggregator
>> scenario:
>>
>> - No common standard to target for exposing and expressing provenance
>> information that captures processes as well as the other content
>> dimensions.
>> - No well-defined standard for linking provenance between sites (i.e.
>> trackback but for the whole web).
>> - No guidance for how exisiting standards can be put together to
>> provide provenance (e.g. linking to identity).
>> - No guidance for how application developers should go about exposing
>> provenance in there web systems.
>>
>>
>>
>
Received on Thursday, 5 August 2010 18:30:50 UTC