- From: Paulo Pinheiro da Silva <paulo@utep.edu>
- Date: Fri, 23 Jul 2010 17:54:45 -0600
- To: "public-xg-prov@w3.org" <public-xg-prov@w3.org>
Hi All, Sorry for the long message below but I would like to move beyond the tags and to further describe the connection of PML and PML-related publications to the news aggregation scenario. The news aggregation scenario says that “many web users would like to have mechanisms to automatically determine whether a web document or resource can be used, based on the original source of the content.” This exact claim is part of the collaborative research that Stanford, IBM and Pacific Northwest National Lab developed in the period between 2003 and 2005 as briefly described in this IBM report: http://www.research.ibm.com/UIMA/SUKI/index.html This project was about aggregation of news from a given domain (e.g., news about a "#panda being moved from Chicago Zoo to Florida”) but also about extracting knowledge from this corpus of news articles and to use the extracted knowledge to answer complex questions in the domain of discussion. In terms of PML, our goal was to encode the provenance of every piece of extracted, derived knowledge and to be able to always track back to the original news articles. This approach is in line with the news aggregation scenario that “wants to ensure that the news that it aggregates are correctly attributed to the right person so that they may receive credit.” PML was used to capture the following provenance information: 1) How spans of text were extracted from sources on the web; 2) How knowledge was extracted from the spans of text; 3) How knowledge was aggregated (for example, dealing with co-resolution of identified entities within documents and across documents); 4) How knowledge was used to derive answers for complex questions (for example, explaining the decision of moving the panda from Chicago Zoo to Florida); 5) More importantly, how was the flow of information from unstructured, asserted text to structured, derived data. For example, using another corpus in another domain, we asked ‘Who is the manager of the Mississippi Automated System Project?’ and it was answered that ‘Julian Allen is the director of the project’. The provenance of the answer is encoded in PML and presented in IWBrowser, a web-based PML browser. The link below is going to show you the provenance trace (you may need to scroll around to see the entire trace). http://browser.inference-web.org/iwbrowser/NodeSetBrowser?w=1600&mg=999&st=Dag&fm=Raw&url=http%3A%2F%2Finference-web.org%2Fproofs%2FMississippiAutomatedSystem%2Fns36.owl%23ns36 The provenance shows how the answer that ‘Julian Allen was the director’ was derived step by step through a bunch of information extraction and integration tools. It is relevant to mention that this provenance was captured from IBM UIMA where different information extraction technologies would compete to produce the best answer for a given question. Furthermore, sometimes the information extraction technologies would produce different and even conflicting answers for the question, which was actually one of the interesting aspects of the process for the intelligence community and a challenge for a provenance language if PML was not ready to accommodate alternative explanations. The following ISWC 2006 paper provides an overview of the scenario above: J. William Murdock, Deborah McGuinness, Paulo Pinheiro da Silva, Chris Welty, and David Ferrucci. Explaining Conclusions from Diverse Knowledge Sources. In Proceedings of the 5th International Semantic Web Conference (ISWC2006), Athens, GA, USA, p. 861-872, November 2006. http://www.cs.utep.edu/paulo/papers/Murdock_ISWC_2006.pdf An explanation concerning the alignment of multiple processes to explain the common goal of extracting information from unstructured data is available in the paper below: J. William Murdock, Paulo Pinheiro da Silva, David Ferrucci, Christopher Welty and Deborah L. McGuinness. Encoding Extraction as Inferences. In Proceedings of AAAI Spring Symposium on Metacognition on Computation, AAAI Press, Stanford University, USA, pages 92-97, 2005. http://www.ksl.stanford.edu/people/pp/papers/Murdock_SSS_2005.pdf The scenario also mentions that “unfortunately for BlogAgg, the source of the information is not often apparent from the data that it aggregates from the web. In particular, it must employ teams of people to check that selected content is both high-quality and can be used legally. The site would like this quality control process to be handled automatically.” Regarding trust, we first point to the following paper that described how one may compute trust based on provenance encoded in PML: Ilya Zaihrayeu, Paulo Pinheiro da Silva and Deborah L. McGuinness. IWTrust: Improving User Trust in Answers from the Web. In Proceedings of 3rd International Conference on Trust Management (iTrust2005), Springer, Rocquencourt, France, pages 384-392, 2005. http://www.cs.utep.edu/paulo/papers/Zaihrayeu_iTrust_2005.pdf Now, the exact representation and dimensions of trust to be considered vary and we would refer to the following paper: Patricia Victor, Chris Cornelis, Martine De Cock, Paulo Pinheiro da Silva. Gradual Trust and Distrust in Recommender Systems. In Fuzzy Sets and Systems 160(10): 1367-1382, 2009. http://www.cs.utep.edu/paulo/papers/Victor_FSS_2007.pdf I hope you all can see the connections between our work and the news aggregation scenario. Many thanks, Paulo.
Received on Saturday, 24 July 2010 00:12:55 UTC