Wikipedia and Geonames. was: AW: ANN: RDF Book Mashup - Integrating Web 2.0 data sources like Amazon and Google into the Semantic Web

>>> I wish that wikipedia had a fully exportable database
>>> http://en.wikipedia.org/wiki/Lists_of_films
>>>
>>> For example, being able to export all data of this movie as RDF,
>>> maybe a templating issue at least for the box on the right.
>>> http://en.wikipedia.org/wiki/2046_%28film%29
>>
>> Should be an easy job for a SIMILE like screen scraper.
>>
>> If you start scraping down from the Wikipedia film list, you  should get 
>> a fair amount of data.

Some further ideas along these lines. What about scraping information about 
geograpic places like countries and cities from Wikipedia and linking the 
data to geonames (http://www.geonames.org/ontology/)?

Something like http://XXX/wikipedia/Embrun owl:sameAs 
http://sws.geonames.org/3020251/

The Wikipedia articles about countries and cities all follow relatively 
similar structures (for instance http://en.wikipedia.org/wiki/Berlin) so it 
should be easy to scrape them. They already contain links to other places, 
like the Boroughs and localities in Berlin, which could easily be 
transformed to RDF links.

Many places have geo-coordinates which together with the place name allow 
scrapers to automatically create links to localities from geonames.

Wikipedia is GNU, thus there aren't any problems with licensing as with the 
Google and Amazon data.

As most articles follow the same structure, an approach to implement such a 
information service could be to:

- Use a crawling/screenscraping framework that fills a relational database 
with the information from Wikipedia.
- Use D2R Server (http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/) 
to publish the database on the Web and to provide a SPARQL end-point for 
querying.

I once read about some pretty sophisticated screen-scraping frameworks that 
fill relational databases with data from websites but forgot the exact 
links. Does anybody know?

Cheers,

Chris

----- Original Message ----- 
From: "Richard Cyganiak" <richard@cyganiak.de>
To: "Richard Newman" <r.newman@reading.ac.uk>
Cc: "Chris Bizer" <chris@bizer.de>; "'Karl Dubost'" <karl@w3.org>; "'Damian 
Steer'" <damian.steer@hp.com>; <semantic-web@w3.org>
Sent: Friday, December 01, 2006 7:19 PM
Subject: Re: AW: ANN: RDF Book Mashup - Integrating Web 2.0 data sources 
like Amazon and Google into the Semantic Web


>
> On 1 Dec 2006, at 18:27, Richard Newman wrote:
>> Systemone have Wikipedia dumped monthly into RDF:
>>
>> http://labs.systemone.at/wikipedia3
>>
>> A public SPARQL endpoint is on their roadmap, but it's only 47  million 
>> triples, so you should be able to load it in a few minutes  on your 
>> machine and run queries locally.
>
> Unfortunately this only represents the hyperlink structure and basic 
> article metadata in RDF. It does no scraping of data from info boxes  or 
> article content. Might be interesting for analyzing Wikipedia's  link 
> structure or social dynamics, but not for content extraction.
>
> Richard
>
>
>
>>
>> -R
>>
>>
>> On  1 Dec 2006, at 4:30 AM, Chris Bizer wrote:
>>
>>>> I wish that wikipedia had a fully exportable database
>>>> http://en.wikipedia.org/wiki/Lists_of_films
>>>>
>>>> For example, being able to export all data of this movie as RDF,
>>>> maybe a templating issue at least for the box on the right.
>>>> http://en.wikipedia.org/wiki/2046_%28film%29
>>>
>>> Should be an easy job for a SIMILE like screen scraper.
>>>
>>> If you start scraping down from the Wikipedia film list, you  should get 
>>> a
>>> fair amount of data.
>>>
>>> To all the Semantic Wiki guys: Has anybody already done something  like 
>>> this?
>>> Are there SPARQL end-points/repositories for Wikipedia-scraped data?
>>
>>
>>
>
> 

Received on Saturday, 2 December 2006 09:35:59 UTC