DBpedia 3.8 released, including enlarged Ontology and additional localized Versions

Hi all,

we are happy to announce the release of DBpedia 3.8. 

The most important improvements of the new release compared to DBpedia 3.7
are:

1. the DBpedia 3.8 release is based on updated Wikipedia dumps dating from
late May/early June 2012.
2. the DBpedia ontology is enlarged and the number of infobox to ontology
mappings has risen.
3. the DBpedia internationalization has progressed and we now provide
localized versions of DBpedia in even more languages.

The English version of the DBpedia 3.8 knowledge base describes 3.77 million
things, out of which 2.35 million are classified in a consistent Ontology,
including 764,000 persons, 573,000 places (including 387,000 populated
places), 333,000 creative works (including 112,000 music albums, 72,000
films and 18,000 video games), 192,000 organizations (including 45,000
companies and 42,000 educational institutions), 202,000 species and 5,500
diseases.

We provide localized versions of DBpedia in 111 languages. All these
versions together describe 20.8 million things, out of which 10.5 mio
overlap (are interlinked) with concepts from the English DBpedia. The full
DBpedia data set features labels and abstracts for 10.3 million unique
things in 111 different languages; 8.0 million links to images and 24.4
million HTML links to external web pages; 27.2 million data links into
external RDF data sets, 55.8 million links to Wikipedia categories, and 8.2
million YAGO categories. The dataset consists of 1.89 billion pieces of
information (RDF triples) out of which 400 million were extracted from the
English edition of Wikipedia, 1.46 billion were extracted from other
language editions, and about 27 million are data links into external RDF
data sets.

The main changes between DBpedia 3.7 and 3.8 are described below:

1. Enlarged Ontology

The DBpedia community added many new classes and properties on the mappings
wiki. The DBpedia 3.8 ontology encompasses
• 359 classes (DBpedia 3.7: 319)
• 800 object properties (DBpedia 3.7: 750)
• 859 datatype properties (DBpedia 3.7: 791)
• 116 specialized datatype properties (DBpedia 3.7: 102)
• 45 owl:equivalentClass and 31 owl:equivalentProperty mappings to
http://schema.org

2. Additional Infobox to Ontology Mappings

The editors of the mappings wiki also defined many new mappings from
Wikipedia templates to DBpedia classes. For the DBpedia 3.8 extraction, we
used 2347 mappings, among them

• Polish: 382 mappings
• English: 345 mappings
• German: 211 mappings
• Portuguese: 207 mappings
• Greek: 180 mappings
• Slovenian: 170 mappings
• Korean: 146 mappings
• Hungarian: 111 mappings
• Spanish: 107 mappings
• Turkish: 91 mappings
• Czech: 66 mappings
• Bulgarian: 61 mappings
• Catalan: 52 mappings
• Arabic: 51 mappings

3. New local DBpedia Chapters

We are also happy to see the number of local DBpedia chapters in different
countries rising. Since the 3.7 DBpedia release we welcomed the French,
Italian and Japanese Chapters. In addition, we expect the Dutch DBpedia
chapter to go online during the next months (in cooperation with
http://bibliotheek.nl/). The DBpedia chapters provide local SPARQL endpoints
and dereferencable URIs for the DBpedia data in their corresponding
language. The DBpedia Internationalization page provides an overview of the
current state of the DBpedia Internationalization effort. 

4. New and updated RDF Links into External Data Sources

We have added new RDF links pointing at resources in the following Linked
Data sources: Amsterdam Museum, BBC Wildlife Finder, CORDIS, DBTune,
Eurostat (Linked Statistics), GADM, LinkedGeoData, OpenEI (Open Energy
Info). In addition, we have updated many of the existing RDF links pointing
at other Linked Data sources.

5. New Wiktionary2RDF Extractor

We developed a DBpedia extractor, that is configurable for any Wiktionary
edition. It generates an comprehensive ontology about languages for use as a
semantic lexical resource in linguistics. The data currently includes
language, part of speech, senses with definitions, synonyms, taxonomies
(hyponyms, hyperonyms, synonyms, antonyms) and translations for each lexical
word. It furthermore is hosted as Linked Data and can serve as a central
linking hub for LOD in linguistics. Currently available languages are
English, German, French, Russian. In the next weeks we plan to add
Vietnamese and Arabic. The goal is to allow the addition of languages just
by configuration without the need of programming skills, enabling
collaboration as in the Mappings Wiki. For more information visit
http://wiktionary.dbpedia.org/

6. Improvements to the Data Extraction Framework

• Additionally to N-Triples and N-Quads, the framework was extended to write
triple files in Turtle format
• Extraction steps that looked for links between different Wikipedia
editions were replaced by more powerful post-processing scripts
• Preparation time and effort for abstract extraction is minimized,
extraction time is reduced to a few milliseconds per page
• To save file system space, the framework can compress DBpedia triple files
while writing and decompress Wikipedia XML dump files while reading
• Using some bit twiddling, we can now load all ~200 million inter-language
links into a few GB of RAM and analyze them
• Users can download ontology and mappings from mappings wiki and store them
in files to avoid downloading them for each extraction, which takes a lot of
time and makes extraction results less reproducible
• We now use IRIs for all languages except English, which uses URIs for
backwards compatibility
• We now resolve redirects in all datasets where the objects URIs are
DBpedia resources
• We check that extracted dates are valid (e.g. February never has 30 days)
and its format is valid according to its XML Schema type, e.g.
xsd:gYearMonth
• We improved the removal of HTML character references from the abstracts
• When extracting raw infobox properties, we make sure that predicate URI
can be used in RDF/XML by appending an underscore if necessary
• Page IDs and Revision IDs datasets now use the DBpedia resource as subject
URI, not the Wikipedia page URL 
• We use foaf:isPrimaryTopicOf instead of foaf:page for the link from
DBpedia resource to Wikipedia page
• New inter-language link datasets for all languages



Accessing the DBpedia 3.8 Release

You can download the new DBpedia dataset from
http://dbpedia.org/Downloads38.

As usual, the dataset is also available as Linked Data and via the DBpedia
SPARQL endpoint at http://dbpedia.org/sparql


Credits

Lots of thanks to

• Jona Christopher Sahnwaldt (Freie Universität Berlin, Germany) for
improving the DBpedia extraction framework and for extracting the DBpedia
3.8 data sets.
• Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for
implementing the language generalizations to the extraction framework.
• Uli Zellbeck and Anja Jentzsch (Freie Universität Berlin, Germany) for
generating the new and updated RDF links to external datasets using the Silk
interlinking framework.
• Jonas Brekle (Universität Leipzig, Germany) and Sebastian Hellmann
(Universität Leipzig, Germany) for their work on the new Wikionary2RDF
extractor.
• All editors that contributed to the DBpedia ontology mappings via the
Mappings Wiki.
• The whole Internationalization Committee for pushing the DBpedia
internationalization forward.
• Kingsley Idehen and Patrick van Kleef (both OpenLink Software) for loading
the dataset into the Virtuoso instance that serves the Linked Data view and
SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether
for providing the server infrastructure for DBpedia.
The work on the DBpedia 3.8 release was financially supported by the
European Commission through the projects LOD2 - Creating Knowledge out of
Linked Data (http://lod2.eu/, improvements to the extraction framework) and
LATC - LOD Around the Clock (http://latc-project.eu/, creation of external
RDF links).


More information about DBpedia is found at http://dbpedia.org/About


Have fun with the new DBpedia release!

Cheers,

Chris Bizer

Received on Monday, 6 August 2012 14:02:02 UTC