PhD position (Toulouse, France): Knowledge extraction from semi-structured documents – enrichment of DBpedia in French

*PhD position: Knowledge extraction from semi-structured documents – 
enrichment of DBpedia in French*

*Context*

We are seeking a candidate for a PhD position in the context of a 
collaboration between the MELODI group ( 
http://www.irit.fr/-Equipe-MELODI- )of the Research Institute in 
Informatics of Toulouse (IRIT, CNRS UMR 5505) and the CLLE-ERSS ( 
ttp://w3.erss.univ-tlse2.fr/ <http://w3.erss.univ-tlse2.fr/> ) team of 
the Cognition, Languages, Ergonomics laboratory (CLLE, UMR 5263 CNRS). 
These laboratories form one of the strongest potentials of research in 
France, in Informatics and Linguistics, respectively. The teams have 
been collaborating for 20 years and are recognized experts in natural 
language processing, linguistic analysis of corpora, and knowledge 
engineering. One of their research areas concerns the linguistic 
characterisation of semantic relations in corpora and the 
operationalization of these characterizations in order to facilitate the 
construction of knowledge models. Methods for analyzing both written 
texts - using lexico-syntactic patterns (Aussenac-Gilles and Jacques, 
2008) or distributional analysis (Fabre et al 2014.) - and text 
structure (Kamel and al., 2014) have been developed.
Methods have also been proposed for integrating different fragments of 
knowledge within a same model, by means of ontology alignments (Euzenat 
et al., 2013). Hence, this thesis aims at adapting and combining these 
methods and proposing novel ones, with a special focus on enriching the 
Web of data. The candidate will be co-supervised by Cécile Fabre, 
Professor of Linguistics at University of Toulouse 2, and Mouna Kamel, 
Assistant Professor at IRIT. The thesis will be funded in the context of 
a project « Communauté d’Universités et d’Établissements Toulouse – 
Région Midi-Pyrénées » (COMUE-Région).

*Object*

This thesis addresses the problem of building semantic resources from 
semi-structured text. The attributes of the text layout, which organise 
the text and contribute significantly to its semantics, 
areunderexploited by most classical NLP methods. A first aim of this 
thesis is to study the interaction between the visual structure and the 
discourse analysis, and thus to specify how the analysis of natural 
language and the analysis of the text structure can be combined 
together. The second aim is to evaluate the contribution of linguistic 
information within automated processes for theconstruction of semantic 
resources, for the identification of semantic relations, and for their 
integration into a knowledge model.

The theoretical results will help to developing different knowledge 
extractors (in particular, semantic relation extractors) from 
semi-structured texts in French, in order to enrich a knowledge base. 
Each extractor will apply one particular technique (inspired or not by 
the methods developed by the teams) and will exploit the different 
properties (content and structure) of these texts. The experimental 
scenario will concern the enrichment of the French DBpedia resource 
(http://fr.dbpedia.org/), by extracting knowledge from Wikipedia pages 
in French. These pages are semi-structured and rich in knowledge 
expressing concepts (domain-specific or general), relations, and rules 
associating them and giving them meaning. However, as for the DBPedia in 
English, this resource is currently constructed from veryspecific 
structured data (infobox, categories, links, etc.) from Wikipedia pages,


*Profile*

We are looking for a candidate with a Msc in Computer 
Engineering/Science or an adjacent field. The candidate has to have 
followed lectures in natural language processing. She/he is required to 
have an interest in both linguistic (corpus analysis, study and 
description of linguistic phenomena, etc.) and statistical aspects that 
will allow her/him to develop learning-based approaches and 
distributional analysis techniques. Interest in the Semantic Web in 
general, and ontologies in particular, would also be appreciated. The 
student has to be fluent in French and has to have a very good level in 
English.

We are currently offeringa 3-year fully-funded 
Studenship<http://kmi.open.ac.uk/studentships/vacancies/> commencing in 
Autumn 2015, thanks to fundings from the Toulouse COMUE and 
Midi-Pyrénées Region. Income will be about 20 000 euros /year.


**

*Contact*

**

If you are interested in the above, please contact :

Cécile Fabre : cecile.fabre@univ-tlse2.fr 
<mailto:cecile.fabre@univ-tlse2.fr>

Mouna Kamel : mouna.kamel@irit.fr <mailto:mouna.kamel@irit.fr>

**

*References*

**

(Aussenac-Gilles et Jacques, 2008) Aussenac–Gilles, N., Jacques, M.–P. : 
Designing and Evaluating Patterns for Relation Acquisition from Texts 
with Caméléon. In: Terminology 14,1, 145–73 (2008).

(Euzenat et al., 2013) J. Euzenat, M. Rosoiu, C. Trojahn dos Santos : 
Ontology matching benchmarks: Generation, stability, and 
discriminability.Journal of Web Semantics 21: 30-48 (2013)

(Fabre et al., 2014) Fabre, C., Hathout, N., Ho-Dac, L. M., 
Morlane-Hondère, F., Muller, P., Sajous, F., Tanguy, L., Van de Cruys, 
T. : Présentation de l'atelier SemDis 2014: sémantique distributionnelle 
pour la substitution lexicale et l'exploration de corpus spécialisés. 
Actes de l'atelier SemDis 2014, 21eConférencesurle 
TraitementAutomatiquedesLanguesNaturelles(TALN 2014),pp.196-205, (2014).

(Kamel et al., 2014) Kamel, M., Rothenburger, B., Fauconnier, J-P. : 
Identification de relations sémantiques portées par les structures 
énumératives paradigmatiques : une approche symbolique et une approche 
par apprentissage supervisé. Revue d'Intelligence Artificielle, Hermès 
Science, Numéro spécial Ingénierie des Connaissances. Nouvelles 
évolutions., Vol. 28, N. 2-3, p. 271-296, (2014).

Received on Friday, 24 April 2015 07:20:56 UTC