ANN: DBpedia 3.5 released from Chris Bizer on 2010-04-12 (public-lod@w3.org from April 2010)

From: Chris Bizer <chris@bizer.de>
Date: Mon, 12 Apr 2010 11:06:01 +0200
To: <dbpedia-discussion@lists.sourceforge.net>, <dbpedia-announcements@lists.sourceforge.net>
Cc: <public-lod@w3.org>, "'SW-forum'" <semantic-web@w3.org>, <semanticweb@yahoogroups.com>
Message-ID: <014f01cada1f$5f63ae50$1e2b0af0$@de>
Hi all,

we are happy to announce the release of DBpedia 3.5. 

The new release is based on Wikipedia dumps dating from March 2010. Compared
to the 3.4 release, we were able to increase the quality of the DBpedia
knowledge base by employing a new data extraction framework which applies
various data cleansing heuristics as well as by extending the
infobox-to-ontology mappings that guide the data extraction process.

The new DBpedia knowledge base describes more than 3.4 million things, out
of which 1.47 million are classified in a consistent ontology, including
312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000
video games, 140,000 organizations, 146,000 species and 4,600 diseases. The
DBpedia data set features labels and abstracts for these 3.2 million things
in up to 92 different languages; 1,460,000 links to images and 5,543,000
links to external web pages; 4,887,000 external links into other RDF
datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The
DBpedia knowledge base altogether consists of over 1 billion pieces of
information (RDF triples) out of which 257 million were extracted from the
English edition of Wikipedia and 766 million were extracted from other
language editions.

The new release provides the following improvements and changes compared to
the DBpedia 3.4 release:

1. The DBpedia extraction framework has been completely rewritten in Scala.
The new framework dramatically reduces the extraction time of a single
Wikipedia article from over 200 to about 13 milliseconds. All features of
the previous PHP framework have been ported. In addition, the new framework
can extract data from Wikipedia tables based on table-to-ontology mappings
and is able to extract multiple infoboxes out of a single Wikipedia article.
The data from each infobox is represented as a separate RDF resource. All
resources that are extracted from a single page can be connected using
custom RDF properties which are also defined in the mappings. A lot of work
also went into the value parsers and the DBpedia 3.5 dataset should
therefore be much cleaner than its predecessors. In addition, units of
measurement are normalized to their respective SI unit, which makes querying
DBpedia easier. 

2. The mapping language that is used to map Wikipedia infoboxes to the
DBpedia Ontology has been redesigned. The documentation of the new mapping
language is found at
http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/trunk/extraction/core/doc/
mapping%20language/

3. In order to enable the DBpedia user community to extend and refine the
infobox to ontology mappings, the mappings can be edited on the newly
created wiki hosted on http://mappings.dbpedia.org. At the moment, 303
template mappings are defined, which cover (including redirects) 1055
templates. On the wiki, the DBpedia Ontology can be edited by the community
as well. At the moment, the ontology consists of 259 classes and about 1,200
properties.
 
4. The ontology properties extracted from infoboxes are now split into two
data sets: 1. The Ontology Infobox Properties dataset contains the
properties as they are defined in the ontology (e.g. length). The range of a
property is either an xsd schema type or a dimension of measurement, in
which case the value is normalized to the respective SI unit. 2. The
Ontology Infobox Properties (Specific) dataset contains properties which
have been specialized for a specific class using a specific unit. e.g. the
property height is specialized on the class Person using the unit
centimeters instead of meters. For further details please refer to
http://wiki.dbpedia.org/Datasets#h18-11.
 
5. The framework now resolves template redirects, making it possible to
cover all redirects to an infobox on Wikipedia with a single mapping. 

6. Three new extractors have been implemented: 1. PageIdExtractor extracting
Wikipedia page IDs are extracted for each page. 2. RevisionExtractor
extracting the latest revision of a page. 3. PNDExtractor extracting PND
(Personnamendatei) identifiers. 

7. The data set now provides labels, abstracts, page links and infobox data
in 92 different languages, which have been extracted from recent Wikipedia
dumps as of March 2010. 

8. In addition the N-Triples datasets, N-Quads datasets are provided which
include a provenance URI to each statement. The provenance URI denotes the
origin of the extracted triple in Wikipedia (For details see:
http://wiki.dbpedia.org/Datasets#h18-18). 

You can download the new DBpedia dataset from
http://wiki.dbpedia.org/Downloads35. As usual, the data set is also
available as Linked Data and via the DBpedia SPARQL endpoint. 

Lots of thanks to: 

* Robert Isele, Anja Jentzsch, Christopher Sahnwaldt, and Paul Kreis (all
Freie Universität Berlin) for reimplementing the DBpedia extraction
framework in Scala, for extending the infobox-to-ontology mappings and for
extracting the new DBpedia 3.5 knowledge base. 

* Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the
knowledge base via the DBpedia download server at Universität Leipzig. 

* Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the
knowledge base into the Virtuoso instance that serves the Linked Data view
and SPARQL endpoint. 

The whole DBpedia team is very thankful to three companies which enabled us
to do all this by supporting and sponsoring the DBpedia project:

* Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based company
offering leading technologies in the area of Web search, social media and
mobile applications.

* Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc.
creates and advances a variety of world-class endeavors and high impact
initiatives that change and improve the way we live, learn, do business
(http://www.vulcan.com/).

* OpenLink Software (http://www.openlinksw.com/). OpenLink Software develops
the Virtuoso Universal Server, an innovative enterprise grade server that
cost-effectively delivers an unrivaled platform for Data Access, Integration
and Management. 

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base! 

Cheers, 

Chris Bizer


--
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
chris@bizer.de
Received on Monday, 12 April 2010 09:04:43 UTC