Wikipedia as POI database (dbpedia geo), sparql and extensibility

Hi folks

I'm sure this is old news to many of you, but for those who didn't see it yet:

The dbpedia project extracts structured data from Wikipedia pages,
integrating mostly from infoboxes. The result is a giant 'graph'
structure, linking entities with relationships, properties etc. They
make it available via various data dumps, as well as providing an
online query service. The querying (and data model) is in terms of
RDF, i.e. 'thing1 -property- thing2' triples, and RDF's 'SPARQL' query


The Geo section of the datasets page describes more,

Static data dumps,

"4.5. Geo-Coordinates

The DBpedia data set contains geo-coordinates for 392,000 geographic
locations. Geo-coordinates are expressed using the W3C Basic Geo

...but also richer structures via their SPARQL query server:

"Besides simple listings of geo-coordinates (e.g., German soccer
stadiums ), the new geo-coordinates allow sophisticated queries, like
“show me all things next to the [...]”.

The above page has hyperlinks into live queries for these examples.
I'll copy some here to give a sense for the way it works:

German Soccer stadiums (directly from a Wikipedia category),

PREFIX geo: <>
SELECT ?subject ?lat ?long WHERE {
?subject skos:subject
?subject geo:lat ?lat.
?subject geo:long ?long.
} LIMIT 20

Fancier query:

Things 'near' Eiffel Tower,

PREFIX geo: <>
SELECT ?subject ?label ?lat ?long WHERE {
<> geo:lat ?eiffelLat.
<> geo:long ?eiffelLong.
?subject geo:lat ?lat.
?subject geo:long ?long.
?subject rdfs:label ?label.
FILTER(xsd:double(?lat) - xsd:double(?eiffelLat) <= 0.05 &&
xsd:double(?eiffelLat) - xsd:double(?lat) <= 0.05 &&
xsd:double(?long) - xsd:double(?eiffelLong) <= 0.05 &&
xsd:double(?eiffelLong) - xsd:double(?long) <= 0.05 &&
lang(?label) = "en"
} LIMIT 20

So far, so geographical. Let's dig into one of these POIs, Eiffel
Tower, to see what's in there.

We can do that by querying, or by looking at the associated Web page
for that item.

Here's a simple dumb query, asking for properties of the tower:

?prop ?val . }

You can run the query at (it's often very
fast, but sometimes overloaded; Amazon EC2 snapshots are available
somewhere if you want your own, or just download the data + Virtuoso
RDF db).

The results are property/value pairs. You can also see them directly
by going to in a browser
(note that it will redirect from the object's ID to the URI of an
associated page, ).

One property is dbpprop:architect and the value is dbpedia:Stephen_Sauvestre

Let's refine our query to find other things with that architect:

PREFIX dbprop: <>
PREFIX dbpedia: <>
PREFIX dbp-owl: <> # assume these
declarations from now on

SELECT DISTINCT * WHERE { ?x dbpprop:architect  dbpedia:Stephen_Sauvestre . }

-> a success / failure. It finds just one result; the Eiffel Tower.
This shows the data is sparse, or inaccurate, or describes things
inconsistently.  So if I backtrack and look at I see the guy's name is
there but in red ink, ie. missing; he has no page in the English
wikipedia, and finding
nearby, it's clear that his works there are described in a different
way, so don't show up in dbpedia yet.

Let's try another property: dbpedia-owl:engineer ...

SELECT DISTINCT * WHERE { ?x dbpedia-owl:engineer	 ?eng . }

This finds a good number of constructions, around 100, plus their
associated engineers.

What can dbpedia tell us about those engineers? The quickest, hackiest
check is just to run another query: we ask for properties ?p and
values ?v of things (?eng) that are the engineer of some other thing

 ?x dbpedia-owl:engineer	 ?eng .
?eng ?p ?v .

Running this, we see constructions like

(...there are also properties relating to 'number of staff' etc., that
shows Wikipedia/DBpedia's notion of an engineer is a bit messy,
sometimes a person, sometimes a company.)

So let's try one of these,

Running that as another SPARQL query, ...

PREFIX dbprop: <>
PREFIX dbpedia: <>
PREFIX dbp-owl: <>
 ?x dbpedia-owl:engineer	 ?eng .
?eng dbprop:nationality dbpedia:USA .

...results are a 2 column table of hits,

x	eng

...same guy every time.

So the lessons here? DBpedia has 300k+ basic POIs, plus a jumble of
other information about the places and associated entities. Nice
things: you can query it with a standard language. Problems: the data
can be very sparse, or varied, and so queries can very easily bring
back much more limited sets of results.

As an extensibility model, it has a lot to recommend it; we can find
POIs based on arbitrary properties (date of birth, schooling,
nationality) of entities that are related to the POI by arbitrary
properties (owner, manager, architect, ... whatever). The downside is
that you're suddenly exploring unknown and rather chaotic data two
clicks from the objects you're actually interested in, so it can be so
gappy you have difficulty finding anything. Perhaps if we also turn to
other related open/public datasets, the sparseness problem can be
addressed? And of course it's also a business opportunity; the lack of
perfect public data isn't directly a problem this WG is supposed to
support. Commercial datasets could fill many of the gaps in dbpedia,
especially if they linked themselves in by using common identifiers
for the real world entities...



Received on Monday, 8 November 2010 13:11:08 UTC