W3C home > Mailing lists > Public > public-lod@w3.org > September 2011

DBpedia 3.7 released, including 15 localized Editions

From: <bizer@zedat.fu-berlin.de>
Date: Sun, 11 Sep 2011 11:22:01 +0200
Message-ID: <52735.88.73.88.80.1315732921.webmail@portal.zedat.fu-berlin.de>
To: dbpedia-discussion@lists.sourceforge.net, public-lod@w3.org, semanticweb@yahoogroups.com, semantic-web@w3.org
Hi all,

we are happy to announce the release of DBpedia 3.7. The new release is
based on Wikipedia dumps dating from late July 2011.

The new DBpedia data set describes more than 3.64 million things, of which
1.83 million are classified in a consistent ontology, including 416,000
persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video
games, 169,000 organizations, 183,000 species and 5,400 diseases.

The DBpedia data set features labels and abstracts for 3.64 million things
in up to 97 different languages; 2,724,000 links to images and 6,300,000
links to external web pages; 6,200,000 external links into other RDF
datasets, and 740,000 Wikipedia categories. The dataset consists of 1
billion pieces of information (RDF triples) out of which 385 million were
extracted from the English edition of Wikipedia and roughly 665 million
were extracted from other language editions and links to external
datasets.

Localized Editions

Up till now, we extracted data from non-English Wikipedia pages only if
there exists an equivalent English page, as we wanted to have a single URI
to identify a resource across all 97 languages. However, since there are
many pages in the non-English Wikipedia editions that do not have an
equivalent English page (especially small towns in different countries,
e.g. the Austrian village Endach, or legal and administrative terms that
are just relevant for a single country) relying on English URIs only had
the negative effect that DBpedia did not contain data for these entities
and many DBpedia users have complained about this shortcoming.

As part of the DBpedia 3.7 release, we now provide 15 localized DBpedia
editions for download that contain data from all Wikipedia pages in a
specific language. These localized editions cover the following languages:
ca, de, el, es, fr, ga, hr, hu, it, nl, pl, pt, ru, sl, tr. The URIs
identifying entities in these i18n data sets are constructed directly from
the non-English title and a language-specific URI namespaces (e.g.
http://ru.dbpedia.org/resource/Berlin), so there are now 16 different URIs
in DBpedia that refer to Berlin. We also extract the inter-language links
from the different Wikipedia editions. Thus, whenever a inter-language
links between a non-English Wikipedia page and its English equivalent
exists, the resulting owl:sameAs link can be used to relate the localized
DBpedia URI to the equivalent in the main (English) DBpedia edition. The
localized DBpedia editions are provided for download on the DBpedia
download page (http://wiki.dbpedia.org/Downloads37). Note that we have not
provide public SPARQL endpoints for the localized editions, nor do the
localized URIs dereference. This might change in the future, as more local
DBpedia chapters are set up in different countries as part of the DBpedia
internationalization effort (http://dbpedia.org/Internationalization).

Other Changes

Beside the new localized editions, the DBpedia 3.7 release provides the
following improvements and changes compared to the last release:

1. Framework

+ Redirects are resolved in a post-processing step for increased
inter-connectivity of 13% (applied for English data sets)
+ Extractor configuration using the dependency injection principle
+ Simple threaded loading of mappings in server
+ Improved international language parsing support thanks to the members of
the Internationalization Committee:
http://dbpedia.org/Internationalization

2. Bugfixes

+ Encode homepage URLs to conform with N-Triples spec
+ Correct reference parsing
+ Recognize MediaWiki parser functions
+ Raw infobox extraction produces more object properties again
+ skos:related for category links starting with “:” and having and anchor
text
+ Restrict objects to Main namespace in MappingExtractor
+ Double rounding (e.g. a person’s height should not be 1800.00000001 cm)
+ Start position in abstract extractor
+ Server can handle template names containing a slash
+ Encoding issues in YAGO dumps

3. Ontology

+ 320 ontology classes
+ 750 object properties
+ 893 datatype properties
+ owl:equivalentClass and owl:equivalentProperty mappings to
http://schema.org

Note that the ontology now is a directed-acyclic graph. Classes can have
multiple superclasses, which was important for the mappings to schema.org.
A taxonomy can still be constructed by ignoring all superclass but the one
that is specified first in the list and is considered the most important.

4. Mappings

+ Dynamic statistics for infobox mappings showing the overall and
individual coverage of the mappings in each language:
http://mappings.dbpedia.org/index.php/Mapping_Statistics
+ Improved DBpedia Ontology as well as improved Infobox mappings using
http://mappings.dbpedia.org/. These improvements are largely due to
collective work by the community before and during the DBpedia Mapping
Creation Sprint. For English, there are 17.5 million RDF statements based
on mappings (13.8 million in version 3.6) (see also
http://dbpedia.org/Downloads37#ontologyinfoboxproperties).
+ ConstantProperty mappings to capture information from the template title
(e.g. Infobox_Australian_Road {{TemplateMapping | mapToClass = Road |
mappings = {{ConstantMapping | ontologyProperty = country | value =
Australia }}}})
+ Language specification for string properties in PropertyMappings (e.g.
Infobox_japan_station: {{PropertyMapping | templateProperty = name |
ontologyProperty = foaf:name | language = ja}} )
+ Multiplication factor in PropertyMappings (e.g. Infobox_GB_station:
{{PropertyMapping | templateProperty = usage0910 | ontologyProperty =
passengersPerYear | factor = 1000000}}, because it’s always specified in
millions)

5. RDF Links to External Data Sources

+ New RDF links pointing at resources in the following Linked Data
sources: Umbel, EUnis, LinkedMDB, Geospecis
+ Updated RDF links pointing at resources in the following Linked Data
sources: Freebase, WordNet, Opencyc, New York Times, Drugbank, Diseasome,
Flickrwrapper, Sider, Factbook, DBLP, Eurostat, Dailymed, Revyu

Accessing the new DBpedia Release

You can download the new DBpedia dataset from http://dbpedia.org/Downloads37.

As usual, the dataset is also available as Linked Data and via the DBpedia
SPARQL endpoint (http://dbpedia.org/sparql).

Credits

Lots of thanks to

+ All editors that contributed to the DBpedia ontology mappings via the
Mappings Wiki.
+ Max Jakob (Freie Universitšt Berlin, Germany) for improving the DBpedia
extraction framework and for extracting the new datasets.
+ Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for
providing language generalizations to the extraction framework.
+ Paul Kreis (Freie Universitšt Berlin, Germany) for administering the
ontology and for delivering the mapping statistics and schema.org
mappings.
+ Uli Zellbeck (Freie Universitšt Berlin, Germany) for providing the links
to external datasets using the Silk framework.
+ The whole Internationalization Committee for expanding some DBpedia
extractors to a number of languages:
http://dbpedia.org/Internationalization.
+ Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the
dataset into the Virtuoso instance that serves the Linked Data view and
SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether
for providing the server infrastructure for DBpedia.

The work on the new release was financially supported by:

+ The European Commission through the project LOD2 - Creating Knowledge
out of Linked Data (http://lod2.eu/, improvements to the extraction
framework).
+ The European Commission through the project LATC - LOD Around the Clock
(http://latc-project.eu/, creation of external RDF links).
+ Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/).
More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new data set!

Cheers,

Chris Bizer
Received on Sunday, 11 September 2011 09:22:38 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:35 UTC