WebNLG+ : Bi-lingual, bi-directional Shared Task on converting text from and to RDF

=============================================== 
The Second WebNLG Challenge: First announcement 
=============================================== 

WebNLG goes bi-lingual (English, Russian) and bi-directional (generation and parsing)! 

It is our pleasure to announce that three years after the first edition, the second WebNLG challenge will take place in 2020. 

TASKS 
The challenge will comprise two main tasks: 
1. RDF-to-text generation, similarly to WebNLG 2017 but with new data and into two languages; 
2. Text-to-RDF semantic parsing: converting a text into the corresponding set of RDF triples. 

For Task 1, given the four RDF triples shown in (a), the aim is to generate a text such as (b) or (c). For Task 2, the opposite should be achieved, i.e. to generate the triples in (a) starting from text as in (b) or (c). 

EXAMPLE 
(a) Set of RDF triples 
<entry category="Company" eid="Id21" size="4"> 
<modifiedtripleset> 
<mtriple>Trane | foundingDate | 1913-01-01</mtriple> 
<mtriple>Trane | location | Ireland</mtriple> 
<mtriple>Trane | foundationPlace | La_Crosse,_Wisconsin</mtriple> 
<mtriple>Trane | numberOfEmployees | 29000</mtriple> 
</modifiedtripleset> 
</entry> 

(b) English text 
Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is based in Ireland. It has 29,000 employees. 

(c) Russian text 
Компания "Тране", основанная 1 января 1913 года в Ла-Кроссе в штате Висконсин, находится в Ирландии. В компании работают 29 тысяч человек. 


INDICATIVE DATES 
The submissions and results dates may be extended to remain aligned with the INLG conference. 
- 15 April 2020: Release of Training and Development Data 
- 17 July 2020: Release of Test Data 
- 31 July 2020: Entry submission deadline 
- 7 September 2020: Results of automatic evaluation and system presentations at INLG 2020 
- October-November 2020 : Results of human evaluation 

DATA 

The English WebNLG 2020 dataset for training will comprise data-text pairs for 16 distinct DBpedia categories: 
* The 9 seen categories used in 2017: Airport, Astronaut, Building, ComicsCharacter, Food, Monument, SportsTeam, University, and WrittenWork. 
* ~5,600 texts were cleaned from misspellings and missing triple verbalisations were added to some texts. 
* The 6 unseen categories of 2017, which will now be part of the seen data: Athlete, Artist, City, CelestialBody, MeanOfTransportation, Politician. 
* 1 new category: Company. 

The new Russian dataset will comprise around 8,000 data inputs and 20,800 data-text pairs for 9 distinct categories: 
* Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food, Monument, SportsTeam, and University. 

For every input triple set, at least two references in each language (English, Russian) will be provided. New test sets will be released for all categories seen in the training data (see above), and for several new unseen categories (categories not included in the training data). The data specifications will be the same as for WebNLG 2017. 

The modalities for evaluation and the instructions for registering to the task and downloading the training and development data will be specified on April 15th. In the meantime, the data, evaluation scripts and system outputs of WebNLG 2017 can be downloaded here : https://webnlg-challenge.loria.fr/challenge_2017/. 

MOTIVATION 
The WebNLG data was originally created to promote the development of RDF verbalisers able to generate short text and to handle micro-planning (i.e., sentence segmentation and ordering, referring expression generation, aggregation); the data for the first challenge included a total of 15 DBpedia categories. The 2020 challenge aims first of all at increasing the datasets (hence, the coverage of the verbalisers), by covering more categories and an additional language. The other main objective of the 2020 edition is to promote the development of knowledge extraction tools, with a task that mirrors the verbalisation task. 


[RDF Verbalisers] The RDF language—in which DBPedia is encoded—is widely used within the Linked Data framework. Many large scale datasets are encoded in this language (e.g., MusicBrainz, FOAF, LinkedGeoData) and official institutions increasingly publish their data in this format. Being able to generate good quality text from RDF data would open the way to many new applications such as making linked data more accessible to lay users, enriching existing text with information drawn from knowledge bases or describing, comparing and relating entities present in these knowledge bases. 

[Multilinguality] By providing a bilingual corpus (English and Russian), we aim to promote the development of tools for languages other than English and to allow for experimentation with pre-training and transfer approaches (do the English verbalisations of RDF triples help in better verbalising the triples in Russian?) 

[Knowledge extraction] The new semantic parsing task opens up new lines of research in several directions. Can it be used to bootstrap entity linkers? How does RDF-based semantic parsing relate to other semantic parsing tasks where the output semantic representations are lambda terms or KB queries? Can semantic parsing be used to improve generation in ways similar to the back translation approaches proposed in machine translation? 

ORGANISING COMMITTEE 
* Thiago Castro Ferreira, Federal University of Minas Gerais, Brazil 
* Claire Gardent, CNRS/LORIA, Nancy, France 
* Nikolai Ilinykh, University of Gothenburg, Sweden 
* Chris van der Lee, Tilburg University, The Netherlands 
* Simon Mille, Universitat Pompeu Fabra, Barcelona, Spain 
* Diego Moussalem, Paderborn University, Germany 
* Anastasia Shimorina, Université de Lorraine/LORIA, Nancy, France 

CONTACT 
webnlg-challenge@inria.fr 

REFERENCES 
* Creating Training Corpora for NLG Micro-Planners. C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of ACL 2017. Vancouver (Canada). 
https://www.aclweb.org/anthology/P17-1017.pdf 
* The WebNLG challenge: Generating text from RDF data. C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of INLG, 2017. Santiago de Compostela (Spain). 
https://www.aclweb.org/anthology/W17-3518.pdf 
* Building RDF Content for Data-to-Text Generation. L. Perez-Beltrachini, R. Sayed and C. Gardent. Proceedings of COLING 2016. Osaka (Japan). 
https://www.aclweb.org/anthology/C16-1141.pdf 
* Enriching the WebNLG corpus. T. Castro Ferreira, D. Moussallem, E. Krahmer and S. Wubben. Proceedings of INLG, 2018. Tilburg (The Netherlands). 
https://www.aclweb.org/anthology/W18-6521.pdf 
* Creating a corpus for Russian data-to-text generation using neural machine translation and post-editing. A. Shimorina, E. Khasanova and C. Gardent. Proceedings of BSNLP Workshop, 2019. Florence (Italy). 
https://www.aclweb.org/anthology/W19-3706.pdf 


-- 
CNRS 
Equipe SYNALP, LORIA 
Nancy, France 

Received on Tuesday, 7 April 2020 13:17:48 UTC