ANN: DBpedia version 2016-10 released

This release took us longer than expected. We had to deal with multiple
issues and included new data. Most notable is the addition of the NIF
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html>
annotation datasets for each language, recording the whole wiki text, its
basic structure (sections, titles, paragraphs, etc.) and the included text
links. We hope that researchers and developers, working on NLP-related
tasks, will find this addition most rewarding. The DBpedia Open Text
Extraction Challenge <http://wiki.dbpedia.org/textext> (next deadline Mon
17 July for SEMANTiCS 2017 <https://2017.semantics.cc/>) was introduced to
instigate new fact extraction based on these datasets.

We want to thank anyone who has contributed to this release, by adding
mappings, new datasets, extractors or issue reports, helping us to increase
coverage and correctness of the released data.  The European Commission and
the ALIGNED H2020 project <http://aligned-project.eu/> for funding and
general support.
This release is based on updated Wikipedia dumps dating from October 2016.

You can download the new DBpedia datasets in N3 / TURTLE serialisation from
http://wiki.dbpedia.org/downloads-2016-1
<http://wiki.dbpedia.org/downloads-2016-10>0 or directly here
http://downloads.dbpedia.org/2016-10/.
<http://downloads.dbpedia.org/2016-10/>
Join and support DBpedia

The active community of developers and engineers comes together in the
DBpedia Community Committee. We will extend this Committee with the help of
Pablo Mendes and Magnus Knuth. Students wishing to join should be or become
a member of the DBpedia Association
<http://wiki.dbpedia.org/dbpedia-association>. Please check all benefits
and details on our website <http://wiki.dbpedia.org/membership>.

Every first Wednesday of the month we organise regular development online
meetings. You can join the next DBpedia dev telco on Wednesday, 5th of July
(@ 2 pm CET). All info regarding the telco can be found here:
http://tinyurl.com/DBpediaDevMinutes.

How to contribute links to DBpedia? Links are the key enabler for retrieval
of related information on the Web of Data and DBpedia is one of the central
interlinking hubs in the LOD cloud. If you're interested in contributing
links and to learn more about the project, please visit
https://github.com/dbpedia/links.
<https://github.com/dbpedia/links>Contributing
to our mappings between Wikipedia and DBpedia ontology
<http://mappings.dbpedia.org/index.php/Main_Page> is also a valuable input
to future releases.

Do you have any questions concerning DBpedia and Linked Data? You can ask
us on our support page <https://dbpedia.atlassian.net/wiki/questions> (Sign
up required for posting). If you already are a user of DBpedia you can help
us by answering some DBpedia-related questions: http://support.dbpedia.org.


Statistics

The English version of the DBpedia knowledge base currently describes 6.6M
entities of which 4.9M have abstracts, 1.9M have geo coordinates and 1.7M
depictions. In total, 5.5M resources are classified in a consistent
ontology, consisting of 1.5M persons, 840K places (including 513K populated
places), 496K works (including 139K music albums, 111K films and 21K video
games), 286K organizations (including 70K companies and 55K educational
institutions), 306K species, 58K plants and 6K diseases. The total number
of resources in English DBpedia is 18M that, besides the 6.6M resources,
includes 1.7M skos concepts (categories), 7.7M redirect pages, 269K
disambiguation pages and 1.7M intermediate nodes.

Altogether the DBpedia 2016-10 release consists of 13 billion (2016-04:
11.5 billion) pieces of information (RDF triples) out of which 1.7 billion
(2016-04: 1.6 billion) were extracted from the English edition of
Wikipedia, 6.6 billion (2016-04: 6 billion) were extracted from other
language editions and 4.8 billion (2016-04: 4 billion) from Wikipedia
Commons and Wikidata.

In addition, adding the large NIF datasets for each language edition (see
details below) increased the number of triples further by over 9 billion,
bringing the overall count up to 23 billion triples.
 (Breaking) Changes

   -

   The NLP Interchange Format (NIF)
   <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html>
   aims to achieve interoperability between Natural Language Processing (NLP)
   tools, language resources and annotations. To extend the versatility of
   DBpedia, furthering many NLP-related tasks, we decided to extract the
   complete human- readable text of any Wikipedia page (‘nif_context’),
   annotated with NIF tags. For this first iteration, we restricted the extent
   of the annotations to the structural text elements directly inferable by
   the HTML (‘nif_page_structure’). In addition, all contained text links are
   recorded in a dedicated dataset (‘nif_text_links’).
   The DBpedia Association started the Open Extraction Challenge
   <http://wiki.dbpedia.org/textext> on the basis of these datasets. We aim
   to spur knowledge extraction from Wikipedia article texts in order to
   dramatically broaden and deepen the amount of structured DBpedia/Wikipedia
   data and provide a platform for benchmarking various extraction tools with
   this effort.
   If you want to participate with your own NLP extraction engine, the next
   deadline for the SEMANTICS 2017 is July 17th.
   We included an example of these structures in section five of the
   download-page <http://wiki.dbpedia.org/downloads-2016-10#p10608-2> of
   this release.
   -

   A considerable amount of work has been done to streamline the extraction
   process of DBpedia, converting many of the extraction tasks into an ETL
   setting (using SPARK <https://spark.apache.org>). We are working in
   concert with the Semantic Web Company <https://semantic-web.com> to
   further enhance these results by introducing a workflow management
   environment to increase the frequency of our releases.

In case you missed it, what we changed in the previous release (2016-04)

   -

   We added a new extractor for citation data that provides two files:
   -

      citation links: linking resources to citations
      -

      citation data: trying to get additional data from citations. This is
      a quite interesting dataset but we need help to clean it up
      -

   In addition to normalised datasets to English DBpedia (en-uris), we
   additionally provide normalised datasets based on the DBpedia Wikidata
   (DBw) datasets (wkd-uris). These sorted datasets will be the foundation for
   the upcoming fusion process with wikidata. The DBw-based uris will be the
   only ones provided from the following releases on.
   -

   We now filter out triples from the Raw Infobox Extractor that are
   already mapped. E.g. no more “<x> dbo:birthPlace <z>” and “<x>
   dbp:birthPlace|dbp:placeOfBirth|... <z>” in the same resource. These
   triples are now moved to the “infobox-properties-mapped” datasets and not
   loaded on the main endpoint. See issue 22
   <https://github.com/dbpedia/extraction-framework/issues/22> for more
   details.
   -

   Major improvements in our citation extraction. See here
   <http://www.mail-archive.com/dbpedia-discussion@lists.sourceforge.net/msg07762.html>
   for more details.
   -

   We incorporated the statistical distribution approach
   <http://www.heikopaulheim.com/docs/iswc2013.pdf> of Heiko Paulheim in
   creating type statements automatically and providing them as additional
   datasets (instance_types_sdtyped_dbo).

Upcoming Changes

   -

   DBpedia Fusion: We finally started working again on fusing DBpedia
   language editions. Johannes Frey is taking the lead in this project. The
   next release will feature intermediate results.
   -

   Id Management: Closely pertaining to the DBpedia Fusion project is our
   effort to introduce our own Id/IRI management, to become independent of
   Wikimedia created IRIs. This will not entail changing out domain or entity
   naming regime, but providing the possibility of adding entities of any
   source or scope.
   -

   RML Integration: Wouter Maroy did already provide the necessary
   groundwork for switching the mappings wiki to an RML based approach
   <https://drive.google.com/file/d/0B7je1jgVmCgISXBPOHc3NDktblU/view?usp=sharing>
   on Github. Wouter started working exclusively on implementing the Git based
   wiki and the conversion of existing mappings last week. We are looking
   forward to the consequent results of this process.
   -

   Further development of SPARK Integration and workflow-based DBpedia
   extraction, to increase the release frequency.


New Datasets

   -

   New languages extracted from Wikipedia:

South Azerbaijani (azb), Upper Sorbian (hsb), Limburgan (li), Minangkabau
(min), Western Mari (mrj), Oriya (or), Ossetian (os)

   -

   SDTypes: We extended the coverage of the automatically created type
   statements (instance_types_sdtyped_dbo) to English, German and Dutch.
   -

   Extensions: In the extension folder (2016-10/ext
   <http://downloads.dbpedia.org/2016-10/ext/>) we provide two new datasets
   (both are to be considered in an experimental state:
   -

      DBpedia World Facts: This dataset is authored by the DBpedia
      Association itself. It lists all countries, all currencies in use and
      (most) languages spoken in the world as well as how these concepts relate
      to each other (spoken in, primary language etc.) and useful
properties like
      iso codes (ontology diagram
      <https://raw.githubusercontent.com/dbpedia/WorldFacts/master/DBpediaWorldFactsOntology.png>).
      This Dataset extends the very useful LEXVO <http://www.lexvo.org/>dataset
      with facts from DBpedia and the CIA Factbook
      <https://www.cia.gov/library/publications/the-world-factbook/>.
      Please report any error or suggestions in regard to this dataset to
      Markus.
      -

      JRC-Alternative-Names: This resource is a link based complementary
      repository of spelling variants for person and organisation
names. The data
      is multilingual and contains up to hundreds of variations entity. It was
      extracted from the analysis of news reports by the Europe Media Monitor
      (EMM) as available on JRC-Names
      <https://data.europa.eu/euodp/en/data/dataset/jrc-names>.

 Community

The DBpedia community added new classes and properties to the DBpedia
ontology via the mappings wiki. The DBpedia 2016-04 ontology encompasses:

   -

   760 classes
   -

   1,105 object properties
   -

   1,622 datatype properties
   -

   132 specialised datatype properties
   -

   414 owl:equivalentClass and 220 owl:equivalentProperty mappings external
   vocabularies

The editor community of the mappings wiki also defined many new mappings
from Wikipedia templates to DBpedia classes. For the DBpedia 2016-10
extraction, we used a total of 5887 template mappings (DBpedia 2015-10:
5800 mappings). The top language, gauged by the number of mappings, is
Dutch (648 mappings), followed by the English community (606 mappings).

Credits to

   -

   Markus Freudenberg (University of Leipzig / DBpedia Association) for
   taking over the whole release process and creating the revamped download &
   statistics pages.
   -

   Dimitris Kontokostas (University of Leipzig / DBpedia Association) for
   conveying his considerable knowledge of the extraction and release process.
   -

   All editors that contributed to the DBpedia ontology mappings via the
   Mappings Wiki.
   -

   The whole DBpedia Internationalization Committee for pushing the DBpedia
   internationalization forward.
   -

   Václav Zeman and the whole LHD team (University of Prague) for their
   contribution of additional DBpedia types
   -

   Alan Meehan (TCD) for performing a big external link cleanup
   -

   Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy) for providing
   the links from DOLCE to DBpedia ontology.
   -

   Robert Belinski for helping with the development in general and the
   debugging of the UriToIri script in particular.
   -

   SpringerNature for offering a co-internship to a bright student and
   developing a closer relation to DBpedia on multiple issues, as well as
   Links to their SciGraph <https://github.com/springernature/scigraph/wiki>
   subjects.
   -

   Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink
   Software) for loading the new data set into the Virtuoso instance that
   provides 5-Star Linked Open Data publication and SPARQL Query Services.
   -

   OpenLink Software (http://www.openlinksw.com/) collectively for
   providing the SPARQL Query Services and Linked Open Data publishing
   infrastructure for DBpedia in addition to their continuous infrastructure
   support.
   -

   Ruben Verborgh from Ghent University – imec for publishing the
dataset as Triple
   Pattern Fragments <http://fragments.dbpedia.org/>, and imec for
   sponsoring DBpedia’s Triple Pattern Fragments server.
   -

   Ali Ismayilov (University of Bonn) for extending and cleaning of the
   DBpedia Wikidata dataset.
   -

   All the GSoC students and mentors which have directly or indirectly on
   the DBpedia release
   -

   Special thanks to members of the DBpedia Association
   <http://dbpedia.org/dbpedia-association>, the AKSW
   <http://aksw.org/About.html> and the Department for Business Information
   Systems <http://bis.informatik.uni-leipzig.de/en/Welcome> of the
   University of Leipzig.

The work on the DBpedia 2016-10 release was financially supported by the
European Commission through the project ALIGNED <http://aligned-project.eu/>
– quality-centric, software and data engineering.

More information about DBpedia is found at http://dbpedia.org as well as in
the new overview article about the project available at
http://wiki.dbpedia.org/Publications
<http://wiki.dbpedia.org/publications/publications-about-dbpedia/curated-publications>.


Have fun with the new DBpedia 2016-10 release!

Markus Freudenberg

Release Manager, DBpedia <http://wiki.dbpedia.org>

Received on Tuesday, 4 July 2017 20:34:04 UTC