ANN: DBpedia - New version of the DBpedia dataset released. from Chris Bizer on 2007-09-05 (semantic-web@w3.org from September 2007)

From: Chris Bizer <chris@bizer.de>
Date: Wed, 5 Sep 2007 17:48:29 +0200
To: <semantic-web@w3.org>, <dbpedia-discussion@lists.sourceforge.net>, "Linking Open Data" <linking-open-data@simile.mit.edu>
Message-ID: <004301c7efd4$3467edc0$418d2da0@wrz03715>
Hi all,

after quite some work into improving the DBpedia information 
extraction framework, we have released a new version of the DBpedia 
dataset today.

DBpedia is a community effort to extract structured information from 
Wikipedia and to make this information available on the Web. DBpedia 
allows you to ask sophisticated queries against Wikipedia and to link 
other datasets on the Web to Wikipedia data.

The DBpedia dataset describes 1,950,000 "things", including at least 
80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It 
contains 657,000 links to images, 1,600,000 links to relevant external 
web pages and 440,000 external links into other RDF datasets. 
Altogether, the DBpedia dataset consists of around 103 million RDF 
triples.

The Dataset has been extracted from the July 2007 Wikipedia dumps of 
English, German, French, Spanish, Italian, Portuguese, Polish, 
Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian 
versions of Wikipedia. It contains descriptions in all these 
languages.

Compared to the last version, we did the following:

1. Improved the Data Quality

We increased the quality of the data, be improving the DBpedia 
information extraction algorithms. So if you have decided that the old 
version of the dataset was too dirty for your application, please look 
again, you will be  surprised  :-)

2. Third Classification Schema Added

We have added a third classification schema to the dataset. Beside of 
the Wikipedia categorization and the YAGO classification, concepts are 
now also classified by associating them to WordNet synsets.

3. Geo-Coordinates

The dataset contains geo-coordinates for  for geographic locations. 
Geo-coordinates are expressed using the W3C Basic Geo Vocabulary. This 
enables location-based SPARQL queries.

4. RDF Links to other Open Datasets

We interlinked DBpedia with further open datasets and ontologies. The 
dataset now contains 440 000 external RDF links into the Geonames, 
Musicbrainz, WordNet, World Factbook, EuroStat, Book Mashup, DBLP 
Bibliography and Project Gutenberg datasets. Altogether, the network 
of interlinked datasources around DBpedia currently amounts to around 
2 billion RDF triples which are accessible as Linked Data on the Web.

The DBpedia dataset is licensed under the terms GNU Free Documentation 
License. The dataset can be accessed online via a SPARQL endpoint and 
as Linked Data. It can also be downloaded in the form of RDF dumps.

Please refer to the DBpedia webpage for more information about the 
dataset and its use cases:

http://dbpedia.org/

Many thanks for their excellent work to:

1. Georgi Kobilarov (Freie Universität Berlin) who redesigned and 
improved the extraction framework and implemented many of the 
interlinking algorithms.
2. Piet Hensel (Freie Universität Berlin) who improved the infobox 
extraction code, wrote the unit test suite.
3. Richard Cyganiak (Freie Universität Berlin) for his advice on 
redesigning the architecture of the extraction framework and for 
helping to solve many annoying Unicode and URI problems.
4. Zdravko Tashev (OpenLink Software) for his patience to try several 
times to import buggy versions of the dataset into Virtuoso.
5. OpenLink Software altogether for providing the server that hosts 
the DBpedia SPARQL endpoint.
6. Sören Auer, Jens Lehmann and Jörg Schüppel (Universität Leipzig) 
for the original version of the infobox extraction code.
7. Tom Heath and Peter Coetzee (Open University) for the RDFS version 
of the YAGO class hirarchy.
8. Fabian M. Suchanek, Gjergji Kasneci (Max-Plank-Institut 
Saarbrücken) for allowing us to integrate the YAGO classification.
9. Christian Becker (Freie Universität Berlin) for writing the 
geo-coordinates and the homepage extractor.
10. Ivan Herman, Tim Berners-Lee, Rich Knopman and many others for 
their bug reports.

Have fun exploring the new dataset :-)

Cheers

Chris

--
Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 54057
Mail: chris@bizer.de
Web: www.bizer.de
Received on Wednesday, 5 September 2007 15:52:48 UTC