W3C home > Mailing lists > Public > public-lod@w3.org > September 2016

Release: JRC-Names multilingual named entity resource in Linked Data format

From: Guillaume Jacquet <guillaume.jacquet@jrc.ec.europa.eu>
Date: Thu, 08 Sep 2016 15:20:44 +0200
To: LN@cines.fr, corpora@uib.no, public-lod@w3.org
Cc: clef@mail.dei.unipd.it, Ralf Steinberger <ralf.steinberger@jrc.ec.europa.eu>, Maud Ehrmann <maud.ehrmann@gmail.com>
Message-id: <cf64021a-7d04-f4c0-51b7-746a46083d60@jrc.ec.europa.eu>
Dear all,

we are pleased to announce a new release of the *JRC-Names* multilingual 
name resource, containing *more information* and now available as 
*Linked Data*.

JRC-Names is a *highly multilingual named entity resource* for person 
and organisation names (called 'entities') developed by the European 
Commission’s Joint Research Centre (JRC). JRC-Names consists of large 
lists of names and their many spelling variants (up to hundreds for a 
single person), including across scripts (Latin, Greek, Arabic, 
Cyrillic, Japanese, Chinese, etc.). For example, the spellings 
Jean-Claude Juncker, Jean Cloud Junker, Jean-Claude Juencker, Жан-Клод 
Юнкер, جان كلود جونكر, Ζαν Κλοντ Γιούνκερ, 让-克洛德•容克, and many others 
have all been identified as referring to the 12th President of the 
European Commission.

The resource is the by-product of the Europe Media Monitor (EMM) family 
of applications, which has been analysing up to 300,000 news reports per 
day, since 2004. EMM recognises names mentioned in the news in over 
twenty languages and decides automatically for each newly found name 
whether it belongs to a new entity or whether it is a spelling variant 
of a previously known entity. This resource allows EMM users to display 
news about people or organisations even if their names are spelt 
differently or if the news articles are written in different languages 
and scripts.

JRC-Names has been available for download since September 2011, 
consisting of name variant lists and accompanying software (JRC-Names 
text version 

The new Linked Data resource 
<https://data.europa.eu/euodp/en/data/dataset/jrc-names>, accessible 
through the European Union’s Open Data Portal 
<http://data.europa.eu/euodp/en/data>, offers more information compared 
to the previously released resource and tool, including:

  * titles and function names that have been historically found next to
    the person mentions;
  * information about the time period during which name variants and
    their titles were found;
  * various frequency counts;
  * links to other linked datasets such as DBpedia, New York Times Open
    Data and Talk of Europe.

The JRC-Names RDF representation is based on /lemon /(Lexicon Model for 
a model developed by the W3C Ontology-Lexica Community group which 
allows the expression of lexical information relative to ontologies. A 
detailed description of JRC-Names Linked Data representation is given in 
the reference paper mentioned below.

Examples of usage of the resource include, among others:

  * entity linking, e.g. to deal with entity surface form variations;
  * cross-lingual linked data-set query and mapping;
  * search query expansion;
  * machine translation;
  * learning of transliteration rules;
  * named entity recognition and disambiguation;
  * cross-lingual document clustering.

This new Linked Data edition is available through a SPARQL 
endpoint and via a RDF dump 
It is registered on the datahub.io portal as JRC-Names 
<https://datahub.io/dataset/jrc-names-ec>. Additional information is 
available on this page 
<http://data.europa.eu/euodp/en/data/dataset/jrc-names>of EU Open Data 
Portal <http://data.europa.eu/euodp/en/data/dataset/jrc-names>.

Examples of queries against the data-set include:

  * Given a person's name, retrieve all of its name variants;
  * Given a person's name, retrieve all of its name variants in a
    certain language;
  * Given a person's name, retrieve all of its titles/function names in
    a certain language;
  * Given a variant and a language, retrieve the corresponding entity;
  * Given a title and a language, retrieve all of the persons with this
    same title.

Reference paper:

Maud Ehrmann, Guillaume Jacquet and Ralf Steinberger (to appear). 
JRC-Names: Multilingual Entity Name variants and titles as Linked Data 
<http://www.semantic-web-journal.net/system/files/swj1307.pdf>, Semantic 
Web Journal (available online since 04/20/2016)

Guillaume Jacquet, Maud Ehrmann, Ralf Steinberger
European Commission
Joint Research Centre
Text and Data Mining Unit
Received on Thursday, 8 September 2016 15:51:54 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 8 September 2016 15:51:54 UTC