W3C home > Mailing lists > Public > public-xg-prov@w3.org > July 2010

news aggregation scenario and PML

From: Paulo Pinheiro da Silva <paulo@utep.edu>
Date: Fri, 23 Jul 2010 17:54:45 -0600
Message-ID: <4C4A2BC5.3050003@utep.edu>
To: "public-xg-prov@w3.org" <public-xg-prov@w3.org>
Hi All,

Sorry for the long message below but I would like to move beyond the 
tags and to further describe the connection of PML and PML-related 
publications to the news aggregation scenario.

The news aggregation scenario says that “many web users would like to
have mechanisms to automatically determine whether a web document or 
resource can be used, based on the original source of the content.” This 
exact claim is part of the collaborative research that Stanford, IBM and 
Pacific Northwest National Lab developed in the period between 2003 and 
2005 as briefly described in this IBM report:

       http://www.research.ibm.com/UIMA/SUKI/index.html

This project was about aggregation of news from a given domain (e.g., 
news about a "#panda being moved from Chicago Zoo to Florida”) but also 
about extracting knowledge from this corpus of news articles and to use 
the extracted knowledge to answer complex questions in the domain of 
discussion.

In terms of PML, our goal was to encode the provenance of every piece of 
extracted, derived knowledge and to be able to always track back to the 
original news articles. This approach is in line with the news 
aggregation scenario that “wants to ensure that the news that it 
aggregates are correctly attributed to the right person so that they may 
receive credit.”

PML was used to capture the following provenance information:
   1) How spans of text were extracted from sources on the web;
   2) How knowledge was extracted from the spans of text;
   3) How knowledge was aggregated (for example, dealing with 
co-resolution of identified entities within documents and across documents);
   4) How knowledge was used to derive answers for complex questions 
(for example, explaining the decision of moving the panda from Chicago 
Zoo to Florida);
   5) More importantly, how was the flow of information from 
unstructured, asserted text to structured, derived data.

For example, using another corpus in another domain, we asked ‘Who is 
the manager of the Mississippi Automated System Project?’ and it was 
answered that ‘Julian Allen is the director of the project’.  The 
provenance of the answer is encoded in PML and presented in IWBrowser, a 
web-based PML browser. The link below is going to show you the 
provenance trace (you may need to scroll around to see the entire trace).

http://browser.inference-web.org/iwbrowser/NodeSetBrowser?w=1600&mg=999&st=Dag&fm=Raw&url=http%3A%2F%2Finference-web.org%2Fproofs%2FMississippiAutomatedSystem%2Fns36.owl%23ns36

The provenance shows how the answer that ‘Julian Allen was the director’ 
was derived step by step through a bunch of information extraction and 
integration tools. It is relevant to mention that this provenance was 
captured from IBM UIMA where different information extraction 
technologies would compete to produce the best answer for a given 
question. Furthermore, sometimes the information extraction technologies 
would produce different and even conflicting answers for the question, 
which was actually one of the interesting aspects of the process for the 
intelligence community and a challenge for a provenance language if PML 
was not ready to accommodate alternative explanations.

The following ISWC 2006 paper provides an overview of the scenario above:

J. William Murdock, Deborah McGuinness, Paulo Pinheiro da Silva, Chris 
Welty, and David Ferrucci. Explaining Conclusions from Diverse Knowledge 
Sources. In Proceedings of the 5th International Semantic Web Conference 
(ISWC2006), Athens, GA, USA, p. 861-872, November 2006. 
http://www.cs.utep.edu/paulo/papers/Murdock_ISWC_2006.pdf

An explanation concerning the alignment of multiple processes to explain 
the common goal of extracting information from unstructured data is 
available in the paper below:

J. William Murdock, Paulo Pinheiro da Silva, David Ferrucci, Christopher 
Welty and Deborah L. McGuinness. Encoding Extraction as Inferences. In 
Proceedings of AAAI Spring Symposium on Metacognition on Computation, 
AAAI Press, Stanford University, USA, pages 92-97, 2005. 
http://www.ksl.stanford.edu/people/pp/papers/Murdock_SSS_2005.pdf

The scenario also mentions that “unfortunately for BlogAgg, the source 
of the information is not often apparent from the data that it 
aggregates from the web. In particular, it must employ teams of people 
to check that selected content is both high-quality and can be used 
legally. The site would like this quality control process to be handled 
automatically.” Regarding trust, we first point to the following paper 
that described how one may compute trust based on provenance encoded in PML:

Ilya Zaihrayeu, Paulo Pinheiro da Silva and Deborah L. McGuinness. 
IWTrust: Improving User Trust in Answers from the Web. In Proceedings of 
3rd International Conference on Trust Management (iTrust2005), Springer, 
Rocquencourt, France, pages 384-392, 2005. 
http://www.cs.utep.edu/paulo/papers/Zaihrayeu_iTrust_2005.pdf

Now, the exact representation and dimensions of trust to be considered 
vary and we would refer to the following paper:

Patricia Victor, Chris Cornelis, Martine De Cock, Paulo Pinheiro da 
Silva. Gradual Trust and Distrust in Recommender Systems. In Fuzzy Sets 
and Systems 160(10): 1367-1382, 2009. 
http://www.cs.utep.edu/paulo/papers/Victor_FSS_2007.pdf

I hope you all can see the connections between our work and the news 
aggregation scenario.

Many thanks,
Paulo.
Received on Saturday, 24 July 2010 00:12:55 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 24 July 2010 00:12:56 GMT