Wikipedia incremental updates from Nicolas Torzec on 2010-01-22 (public-lod@w3.org from January 2010)

From: Nicolas Torzec <torzecn@yahoo-inc.com>
Date: Thu, 21 Jan 2010 19:35:24 -0800
To: <public-lod@w3.org>
Message-ID: <C77E5CFC.25C5%torzecn@yahoo-inc.com>

Hi there,

I am using open data sets such as Wikipedia for data mining and knowledge
acquisition purposes; entities and relations extracted being exposed and
consumed via indices.

I am already retrieving and processing new Wikipedia static dumps every time
they are available, but I would like to go beyond this and use
incremental/live updates to be more in synch with Wikipedia content.

I know that I could use some Web services and IRC Channels for tracking
changes in Wikipedia but, beside the fact that the web service has been
designed more for tracking individual changes than monitoring Wikipedia
changes continuously, these two methods will still require to parse the
update messages (for extracting the URLs of the new/modified/deleted pages)
and then to retrieve the actual pages.

Does anyone has experience with that?

Is there any other way to retrieve incremental updates in a reliable and
continuous way, especially in the same format as the one provided for the
static dumps?  (mysql replication, incremental dumps... )

I have also read that DBpedia was trying to be more in sync with Wikipedia
content. How do they plan to stay in sync with Wikipedia updates?


Thanks for your help.

Best,
Nicolas Torzec.

Received on Friday, 22 January 2010 03:45:04 UTC