- From: Dan Brickley <danbri@danbri.org>
- Date: Mon, 8 Nov 2010 14:10:35 +0100
- To: public-poiwg@w3.org
Hi folks I'm sure this is old news to many of you, but for those who didn't see it yet: The dbpedia project extracts structured data from Wikipedia pages, integrating mostly from infoboxes. The result is a giant 'graph' structure, linking entities with relationships, properties etc. They make it available via various data dumps, as well as providing an online query service. The querying (and data model) is in terms of RDF, i.e. 'thing1 -property- thing2' triples, and RDF's 'SPARQL' query language. >From http://wiki.dbpedia.org/Datasets The Geo section of the datasets page describes more, http://wiki.dbpedia.org/Datasets#h18-17 Static data dumps, "4.5. Geo-Coordinates The DBpedia data set contains geo-coordinates for 392,000 geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary." ...but also richer structures via their SPARQL query server: "Besides simple listings of geo-coordinates (e.g., German soccer stadiums ), the new geo-coordinates allow sophisticated queries, like “show me all things next to the [...]”. The above page has hyperlinks into live queries for these examples. I'll copy some here to give a sense for the way it works: German Soccer stadiums (directly from a Wikipedia category), PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> SELECT ?subject ?lat ?long WHERE { ?subject skos:subject <http://dbpedia.org/resource/Category:Football_venues_in_Germany>. ?subject geo:lat ?lat. ?subject geo:long ?long. } LIMIT 20 Fancier query: Things 'near' Eiffel Tower, PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> SELECT ?subject ?label ?lat ?long WHERE { <http://dbpedia.org/resource/Eiffel_Tower> geo:lat ?eiffelLat. <http://dbpedia.org/resource/Eiffel_Tower> geo:long ?eiffelLong. ?subject geo:lat ?lat. ?subject geo:long ?long. ?subject rdfs:label ?label. FILTER(xsd:double(?lat) - xsd:double(?eiffelLat) <= 0.05 && xsd:double(?eiffelLat) - xsd:double(?lat) <= 0.05 && xsd:double(?long) - xsd:double(?eiffelLong) <= 0.05 && xsd:double(?eiffelLong) - xsd:double(?long) <= 0.05 && lang(?label) = "en" ). } LIMIT 20 So far, so geographical. Let's dig into one of these POIs, Eiffel Tower, to see what's in there. We can do that by querying, or by looking at the associated Web page for that item. Here's a simple dumb query, asking for properties of the tower: SELECT DISTINCT * WHERE { <http://dbpedia.org/resource/Eiffel_Tower> ?prop ?val . } You can run the query at http://dbpedia.org/sparql (it's often very fast, but sometimes overloaded; Amazon EC2 snapshots are available somewhere if you want your own, or just download the data + Virtuoso RDF db). The results are property/value pairs. You can also see them directly by going to http://dbpedia.org/resource/Eiffel_Tower in a browser (note that it will redirect from the object's ID to the URI of an associated page, http://dbpedia.org/page/Eiffel_Tower ). One property is dbpprop:architect and the value is dbpedia:Stephen_Sauvestre Let's refine our query to find other things with that architect: PREFIX dbprop: <http://dbpedia.org/property/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbp-owl: <http://dbpedia.org/ontology/> # assume these declarations from now on SELECT DISTINCT * WHERE { ?x dbpprop:architect dbpedia:Stephen_Sauvestre . } -> a success / failure. It finds just one result; the Eiffel Tower. This shows the data is sparse, or inaccurate, or describes things inconsistently. So if I backtrack and look at http://en.wikipedia.org/wiki/Eiffel_Tower I see the guy's name is there but in red ink, ie. missing; he has no page in the English wikipedia, and finding http://fr.wikipedia.org/wiki/Stephen_Sauvestre nearby, it's clear that his works there are described in a different way, so don't show up in dbpedia yet. Let's try another property: dbpedia-owl:engineer ... SELECT DISTINCT * WHERE { ?x dbpedia-owl:engineer ?eng . } This finds a good number of constructions, around 100, plus their associated engineers. What can dbpedia tell us about those engineers? The quickest, hackiest check is just to run another query: we ask for properties ?p and values ?v of things (?eng) that are the engineer of some other thing ?x: SELECT DISTINCT ?p ?v ?eng WHERE { ?x dbpedia-owl:engineer ?eng . ?eng ?p ?v . } Running this, we see constructions like http://dbpedia.org/property/placeOfBirth http://dbpedia.org/property/placeOfDeath http://dbpedia.org/property/significantProjects http://dbpedia.org/property/nationality http://dbpedia.org/property/education (...there are also properties relating to 'number of staff' etc., that shows Wikipedia/DBpedia's notion of an engineer is a bit messy, sometimes a person, sometimes a company.) So let's try one of these, http://dbpedia.org/property/nationality http://dbpedia.org/resource/USA Running that as another SPARQL query, ... PREFIX dbprop: <http://dbpedia.org/property/> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbp-owl: <http://dbpedia.org/ontology/> SELECT DISTINCT * WHERE { ?x dbpedia-owl:engineer ?eng . ?eng dbprop:nationality dbpedia:USA . } ...results are a 2 column table of hits, x eng http://dbpedia.org/resource/Gate_of_Europe http://dbpedia.org/resource/Leslie_E._Robertson http://dbpedia.org/resource/Shanghai_World_Financial_Center http://dbpedia.org/resource/Leslie_E._Robertson http://dbpedia.org/resource/DST_Group_Building http://dbpedia.org/resource/Leslie_E._Robertson http://dbpedia.org/resource/PGGMB_Building http://dbpedia.org/resource/Leslie_E._Robertson http://dbpedia.org/resource/Ministry_of_Finance_Brunei http://dbpedia.org/resource/Leslie_E._Robertson ...same guy every time. So the lessons here? DBpedia has 300k+ basic POIs, plus a jumble of other information about the places and associated entities. Nice things: you can query it with a standard language. Problems: the data can be very sparse, or varied, and so queries can very easily bring back much more limited sets of results. As an extensibility model, it has a lot to recommend it; we can find POIs based on arbitrary properties (date of birth, schooling, nationality) of entities that are related to the POI by arbitrary properties (owner, manager, architect, ... whatever). The downside is that you're suddenly exploring unknown and rather chaotic data two clicks from the objects you're actually interested in, so it can be so gappy you have difficulty finding anything. Perhaps if we also turn to other related open/public datasets, the sparseness problem can be addressed? And of course it's also a business opportunity; the lack of perfect public data isn't directly a problem this WG is supposed to support. Commercial datasets could fill many of the gaps in dbpedia, especially if they linked themselves in by using common identifiers for the real world entities... cheers, Dan
Received on Monday, 8 November 2010 13:11:08 UTC