Re: news aggregation scenario and PML from Paul Groth on 2010-07-26 (public-xg-prov@w3.org from July 2010)

From: Paul Groth <pgroth@gmail.com>
Date: Mon, 26 Jul 2010 10:28:10 +0200
To: Paulo Pinheiro da Silva <paulo@utep.edu>
CC: "public-xg-prov@w3.org" <public-xg-prov@w3.org>
Message-ID: <4C4D471A.9000800@gmail.com>
Hi Paulo,

Thanks for the explanation and the pointers. This will be useful in 
doing the gap analysis. In general, I have the feeling that what makes 
the news aggregator scenario hard is the fact that no one system is 
controlled by the same user/organization and that they  don't all use 
the same technology. So I guess the problem is not necessarily 
technology but more one of interoperability. (I may be wrong on that  
but you provide some evidence  that I may be right).

Anyway, thanks for taking the time,
Paul


Paulo Pinheiro da Silva wrote:
> Hi All,
>
> Sorry for the long message below but I would like to move beyond the 
> tags and to further describe the connection of PML and PML-related 
> publications to the news aggregation scenario.
>
> The news aggregation scenario says that “many web users would like to
> have mechanisms to automatically determine whether a web document or 
> resource can be used, based on the original source of the content.” 
> This exact claim is part of the collaborative research that Stanford, 
> IBM and Pacific Northwest National Lab developed in the period between 
> 2003 and 2005 as briefly described in this IBM report:
>
>       http://www.research.ibm.com/UIMA/SUKI/index.html
>
> This project was about aggregation of news from a given domain (e.g., 
> news about a "#panda being moved from Chicago Zoo to Florida”) but 
> also about extracting knowledge from this corpus of news articles and 
> to use the extracted knowledge to answer complex questions in the 
> domain of discussion.
>
> In terms of PML, our goal was to encode the provenance of every piece 
> of extracted, derived knowledge and to be able to always track back to 
> the original news articles. This approach is in line with the news 
> aggregation scenario that “wants to ensure that the news that it 
> aggregates are correctly attributed to the right person so that they 
> may receive credit.”
>
> PML was used to capture the following provenance information:
>   1) How spans of text were extracted from sources on the web;
>   2) How knowledge was extracted from the spans of text;
>   3) How knowledge was aggregated (for example, dealing with 
> co-resolution of identified entities within documents and across 
> documents);
>   4) How knowledge was used to derive answers for complex questions 
> (for example, explaining the decision of moving the panda from Chicago 
> Zoo to Florida);
>   5) More importantly, how was the flow of information from 
> unstructured, asserted text to structured, derived data.
>
> For example, using another corpus in another domain, we asked ‘Who is 
> the manager of the Mississippi Automated System Project?’ and it was 
> answered that ‘Julian Allen is the director of the project’.  The 
> provenance of the answer is encoded in PML and presented in IWBrowser, 
> a web-based PML browser. The link below is going to show you the 
> provenance trace (you may need to scroll around to see the entire trace).
>
> http://browser.inference-web.org/iwbrowser/NodeSetBrowser?w=1600&mg=999&st=Dag&fm=Raw&url=http%3A%2F%2Finference-web.org%2Fproofs%2FMississippiAutomatedSystem%2Fns36.owl%23ns36 
>
>
> The provenance shows how the answer that ‘Julian Allen was the 
> director’ was derived step by step through a bunch of information 
> extraction and integration tools. It is relevant to mention that this 
> provenance was captured from IBM UIMA where different information 
> extraction technologies would compete to produce the best answer for a 
> given question. Furthermore, sometimes the information extraction 
> technologies would produce different and even conflicting answers for 
> the question, which was actually one of the interesting aspects of the 
> process for the intelligence community and a challenge for a 
> provenance language if PML was not ready to accommodate alternative 
> explanations.
>
> The following ISWC 2006 paper provides an overview of the scenario above:
>
> J. William Murdock, Deborah McGuinness, Paulo Pinheiro da Silva, Chris 
> Welty, and David Ferrucci. Explaining Conclusions from Diverse 
> Knowledge Sources. In Proceedings of the 5th International Semantic 
> Web Conference (ISWC2006), Athens, GA, USA, p. 861-872, November 2006. 
> http://www.cs.utep.edu/paulo/papers/Murdock_ISWC_2006.pdf
>
> An explanation concerning the alignment of multiple processes to 
> explain the common goal of extracting information from unstructured 
> data is available in the paper below:
>
> J. William Murdock, Paulo Pinheiro da Silva, David Ferrucci, 
> Christopher Welty and Deborah L. McGuinness. Encoding Extraction as 
> Inferences. In Proceedings of AAAI Spring Symposium on Metacognition 
> on Computation, AAAI Press, Stanford University, USA, pages 92-97, 
> 2005. http://www.ksl.stanford.edu/people/pp/papers/Murdock_SSS_2005.pdf
>
> The scenario also mentions that “unfortunately for BlogAgg, the source 
> of the information is not often apparent from the data that it 
> aggregates from the web. In particular, it must employ teams of people 
> to check that selected content is both high-quality and can be used 
> legally. The site would like this quality control process to be 
> handled automatically.” Regarding trust, we first point to the 
> following paper that described how one may compute trust based on 
> provenance encoded in PML:
>
> Ilya Zaihrayeu, Paulo Pinheiro da Silva and Deborah L. McGuinness. 
> IWTrust: Improving User Trust in Answers from the Web. In Proceedings 
> of 3rd International Conference on Trust Management (iTrust2005), 
> Springer, Rocquencourt, France, pages 384-392, 2005. 
> http://www.cs.utep.edu/paulo/papers/Zaihrayeu_iTrust_2005.pdf
>
> Now, the exact representation and dimensions of trust to be considered 
> vary and we would refer to the following paper:
>
> Patricia Victor, Chris Cornelis, Martine De Cock, Paulo Pinheiro da 
> Silva. Gradual Trust and Distrust in Recommender Systems. In Fuzzy 
> Sets and Systems 160(10): 1367-1382, 2009. 
> http://www.cs.utep.edu/paulo/papers/Victor_FSS_2007.pdf
>
> I hope you all can see the connections between our work and the news 
> aggregation scenario.
>
> Many thanks,
> Paulo.
Received on Monday, 26 July 2010 08:33:01 UTC