news aggregation scenario and PML from Paulo Pinheiro da Silva on 2010-07-23 (public-xg-prov@w3.org from July 2010)

From: Paulo Pinheiro da Silva <paulo@utep.edu>
Date: Fri, 23 Jul 2010 17:54:45 -0600
To: "public-xg-prov@w3.org" <public-xg-prov@w3.org>
Message-ID: <4C4A2BC5.3050003@utep.edu>

Hi All,

Sorry for the long message below but I would like to move beyond the
tags and to further describe the connection of PML and PML-related
publications to the news aggregation scenario.

The news aggregation scenario says that “many web users would like to
have mechanisms to automatically determine whether a web document or
resource can be used, based on the original source of the content.” This
exact claim is part of the collaborative research that Stanford, IBM and
Pacific Northwest National Lab developed in the period between 2003 and
2005 as briefly described in this IBM report:

http://www.research.ibm.com/UIMA/SUKI/index.html

This project was about aggregation of news from a given domain (e.g.,
news about a "#panda being moved from Chicago Zoo to Florida”) but also
about extracting knowledge from this corpus of news articles and to use
the extracted knowledge to answer complex questions in the domain of
discussion.

In terms of PML, our goal was to encode the provenance of every piece of
extracted, derived knowledge and to be able to always track back to the
original news articles. This approach is in line with the news
aggregation scenario that “wants to ensure that the news that it
aggregates are correctly attributed to the right person so that they may
receive credit.”

PML was used to capture the following provenance information:
1) How spans of text were extracted from sources on the web;
2) How knowledge was extracted from the spans of text;
3) How knowledge was aggregated (for example, dealing with
co-resolution of identified entities within documents and across documents);
4) How knowledge was used to derive answers for complex questions
(for example, explaining the decision of moving the panda from Chicago
Zoo to Florida);
5) More importantly, how was the flow of information from
unstructured, asserted text to structured, derived data.

For example, using another corpus in another domain, we asked ‘Who is
the manager of the Mississippi Automated System Project?’ and it was
answered that ‘Julian Allen is the director of the project’. The
provenance of the answer is encoded in PML and presented in IWBrowser, a
web-based PML browser. The link below is going to show you the
provenance trace (you may need to scroll around to see the entire trace).

http://browser.inference-web.org/iwbrowser/NodeSetBrowser?w=1600&mg=999&st=Dag&fm=Raw&url=http%3A%2F%2Finference-web.org%2Fproofs%2FMississippiAutomatedSystem%2Fns36.owl%23ns36

The provenance shows how the answer that ‘Julian Allen was the director’
was derived step by step through a bunch of information extraction and
integration tools. It is relevant to mention that this provenance was
captured from IBM UIMA where different information extraction
technologies would compete to produce the best answer for a given
question. Furthermore, sometimes the information extraction technologies
would produce different and even conflicting answers for the question,
which was actually one of the interesting aspects of the process for the
intelligence community and a challenge for a provenance language if PML
was not ready to accommodate alternative explanations.

The following ISWC 2006 paper provides an overview of the scenario above:

J. William Murdock, Deborah McGuinness, Paulo Pinheiro da Silva, Chris
Welty, and David Ferrucci. Explaining Conclusions from Diverse Knowledge
Sources. In Proceedings of the 5th International Semantic Web Conference
(ISWC2006), Athens, GA, USA, p. 861-872, November 2006.
http://www.cs.utep.edu/paulo/papers/Murdock_ISWC_2006.pdf

An explanation concerning the alignment of multiple processes to explain
the common goal of extracting information from unstructured data is
available in the paper below:

J. William Murdock, Paulo Pinheiro da Silva, David Ferrucci, Christopher
Welty and Deborah L. McGuinness. Encoding Extraction as Inferences. In
Proceedings of AAAI Spring Symposium on Metacognition on Computation,
AAAI Press, Stanford University, USA, pages 92-97, 2005.
http://www.ksl.stanford.edu/people/pp/papers/Murdock_SSS_2005.pdf

The scenario also mentions that “unfortunately for BlogAgg, the source
of the information is not often apparent from the data that it
aggregates from the web. In particular, it must employ teams of people
to check that selected content is both high-quality and can be used
legally. The site would like this quality control process to be handled
automatically.” Regarding trust, we first point to the following paper
that described how one may compute trust based on provenance encoded in PML:

Ilya Zaihrayeu, Paulo Pinheiro da Silva and Deborah L. McGuinness.
IWTrust: Improving User Trust in Answers from the Web. In Proceedings of
3rd International Conference on Trust Management (iTrust2005), Springer,
Rocquencourt, France, pages 384-392, 2005.
http://www.cs.utep.edu/paulo/papers/Zaihrayeu_iTrust_2005.pdf

Now, the exact representation and dimensions of trust to be considered
vary and we would refer to the following paper:

Patricia Victor, Chris Cornelis, Martine De Cock, Paulo Pinheiro da
Silva. Gradual Trust and Distrust in Recommender Systems. In Fuzzy Sets
and Systems 160(10): 1367-1382, 2009.
http://www.cs.utep.edu/paulo/papers/Victor_FSS_2007.pdf

I hope you all can see the connections between our work and the news
aggregation scenario.

Many thanks,
Paulo.

Received on Saturday, 24 July 2010 00:12:55 UTC