NLG challenge on RDF triple selection (Announcement and Call for Expressions of Interest) from Gerard Casamayor on 2012-07-24 (public-lod@w3.org from July 2012)

From: Gerard Casamayor <gerard.casamayor@upf.edu>
Date: Tue, 24 Jul 2012 08:54:16 +0200
To: public-lod@w3.org
Message-ID: <CAHyK_cUiZJDB34JvWBmEQXbegtC_yH8-w5S0g299nD=srutd4w@mail.gmail.com>
Dear colleagues,

Please find below an announcement and call for expressions of interest for
the upcoming 2013 Natural Language Generation Challenge on RDF triple
selection using freely available Semantic Web data and associated texts.

We believe that the challenge may be of interest not only to researchers in
Natural Language Generation but also to the Semantic Web comunity at large.

Thanks and best regards,

The Content Selection GenChal'13 team:
Nadjet Bouayad-Agha, Gerard Casamayor, Chris Mellish and Leo Wanner
---

Announcement and Call for Expressions of Interest

FIRST RDF TRIPLE SELECTION CHALLENGE
European Workshop on Natural Language Generation, 2013.

We seek expressions of interest to participate in a challenge on content
selection
using freely available Semantic Web data and associated texts. Please read
on
and, if you are interested, please contact us (see contact details at the
end of
this call).

-----------
Motivation:
-----------

In the context of the Semantic Web, Natural Language Generation (NLG)
technologies offer a promising mechanism for the production and
publication of documents from data. End users find Natural
Language more accessible and easier to understand than data encoded
in Semantic Web standards like RDF and OWL. Furthermore, NLG systems
are capable of producing multilingual and multimodal documents tailored
to specific contexts (e.g. a user profile) that communicate relevant
information in fluent natural language.

Traditionally, NLG systems start off with a content selection step that
mimics the assessment of contents carried out by human authors. Content
selection takes as input a data source and produces a subset of contents
to be included in the text. When applied to Linked Data published on the
Semantic Web, content selection faces new challenges due to the large
size and heterogeneous nature of the datasets. New methods for the
selection of contents are needed that scale up NLG systems to the
Semantic Web.

These content selection methods can prove useful not only for the
publication of contents in the Semantic Web but also for any Semantic Web
task that requires judging the relative importance of data within one or
multiple linked datasets, from search and information retrieval supported
by semantic data to the summarization of large datasets to their most
relevant parts. Likewise, methods used for these tasks can also help in
improving the selection of contents for NLG-based publication of Semantic
Web data.

For these reasons, we believe that the time has come to bring together
researchers working on (or interested in working on) content selection
from semantic web data to participate in a challenge for this task.

This initial challenge presents a relatively simple content selection
task from a single dataset so that people are encouraged to take part and
motivated to stay on for later challenges, in which the task will be
successively enhanced from gained experience.

A content determination challenge will be a chance to (i) directly
compare the performance of different types of content selection strategies;
(ii) contribute towards developing a standard ``off-the-shelf'' content
selection module; (iii) contribute towards a standard interface between
content selection and other tasks involved in linguistic generation; and
(iv) exchange knowledge and methods between different communities working
with Semantic Web data.

--------------------
Outline of the task:
--------------------

The core of the task to be addressed can be formulated as follows:

``Build a system which, given a set of RDF triples containing facts
about a celebrity, selects those triples that are reflected in a corpus
of biographical texts and associated semantic data."

----------------
Domain and Data:
----------------

The domain will be short biographies of famous people due to the
availability
of biography texts in Wikipedia and rich data representations in DBPedia or
Freebase repositories.

The data will consist of a corpus where, for each famous person, an
RDF-triple
set is associated to text(s). For each pair, the RDF data will
include both information communicated and excluded from the text. The
text may convey information not present in the RDF-triples, but this
will be kept to a minimum, always subject to using naturally-occurring
texts. All pairs should contain enough RDF-triples and text to make
the pair interesting for the content selection task.

-----------------------------
Data Preparation and Release:
-----------------------------

The task of data preparation consists in 1) data and texts downloading,
pairing and
preprocessing in a suitable format, and 2) working dataset selection and
annotation.

The annotation task, in which the participants are encouraged to
participate and
which could be supported by some automatic anchoring techniques and tools,
consists
in marking which triples are included in the text for each data-text triple
of the
working dataset. Annotation guidelines will be provided with examples and
descriptions
of ambiguities and other issues and how to resolve them.

The resulting annotated  working dataset will be provided to the
participants as a
common set of ``correct answers" to exploit in their approach.

The participants will also be free to exploit a large portion of the
non-marked paired
corpus, as well as the data semantics (i.e., ontologies and the like).

-----------
Evaluation:
-----------

Once all participants have submitted their executable to solve the
task, the evaluation set will be processed. If timing is tight,
however, this could be done whilst the participants are still working
on the task or extra effort (for instance, from the organizers) could
be brought in. A subset of the data is randomly selected and annotated
with the selected triples by the participants.  This two-stage
approach to triple selection annotation is proposed in order to avoid
any bias on the evaluation data.

Each executable will be run against the test corpus and the selected
triples evaluated against the gold triple selection set. Since this is
formally a relatively simple task of selecting a subset of a given
set, we will use for evaluation standard precision, recall and F
measures. In addition, other appropriate metrics will be
explored---for instance, certain metrics for extractive summarisation
(which is to some extent a similar task).

The organizers will explore whether it will be feasible to select and
annotate some test examples from a different corpus and have the
systems evaluated on these as a separate task.

------------------
Proposed Timeline:
------------------

Preparation of working dataset in the summer of 2012 will start once we
gather sufficient interest from would-be participants.

The challenge proper will take place between November 2012 and May/June
2013
as detailed below.

Data gathering and preparation Aug 2012
Working dataset selection and annotation Sept/Oct 2012
Data Release November 2012
Evaluation dataset selection and annotation May 2013
Evaluation June  2013
Publication @EWNLG August 2013

------------------------
Expressions of Interest:
------------------------

In order to gather some quorum, we ask people interested in participating
to
send us a mail expressing their interests as early as possible (i.e., by
the 7th of August).

The challenge is open to any approach, be it template-, rule- or
heuristic-based,
or empirical.

---------------------
Organizing committee:
---------------------

Leo Wanner  TALN Group, University Pompeu Fabra, Barcelona (Spain).
Nadjet Bouayad-Agha
Gerard Casamayor

Chris Mellish NLG Group, University of Aberdeen, Scotland (UK).

--------
Contact:
--------

nadjet.bouayad@upf.edu
Received on Tuesday, 24 July 2012 06:58:13 UTC