Guessing geographical locations for Wikipedia article subjects

Hi,

  Many Wikipedia articles have associated geographical coordinates. That
is however usually limited to stationary objects, even though many more
things have a very strong connection to a particular region, a regional
tradition for instance. In order to associate some location with more of
the articles, some method to derive them is needed.

The method that comes to mind is network analysis, how one article re-
lates to other articles by linking other articles, by being linked from
other articles, by being in the same category as others articles, by be-
ing edited by authors who focus on a particular region, and so on.

To get an idea how well a location can be predicted, I've analyzed the
incoming and outgoing links of articles that already have coordinates
associated to them, and compared my projected locations to the pre-de-
fined location. To project the location, I've simply determined the
median center of the linked or linking article's coordinates using an
iterated approximation algorithm and compared the result to the great
circle (haversine) distance.

For the german language version of Wikipedia the distances between the
projected and the pre-defined location are distributed as follows:

  +---------------+----------+----------+
  | % of Articles | Outgoing | Incoming |
  +---------------+----------+----------+
  |           10% |      1km |      1km |
  |           20% |      2km |      2km |
  |           30% |      4km |      3km |
  |           40% |      7km |      5km |
  |           50% |     11km |      7km |
  |           60% |     18km |     12km |
  |           70% |     31km |     20km |
  |           80% |     69km |     39km |
  |           90% |    203km |    135km |
  |          100% |  19759km |  20016km |
  +---------------+----------+----------+

The median great circle distances between the projected location and the
coordinates of linked and linking articles is distributed as follows:

  +---------------+----------+----------+
  | % of Articles | Outgoing | Incoming |
  +---------------+----------+----------+
  |           10% |      2km |      0km |
  |           20% |      4km |      1km |
  |           30% |      7km |      3km |
  |           40% |     12km |      4km |
  |           50% |     16km |      7km |
  |           60% |     26km |     12km |
  |           70% |     48km |     18km |
  |           80% |    106km |     35km |
  |           90% |    259km |    128km |
  |          100% |  19350km |  18879km |
  +---------------+----------+----------+

This is based on 181796 articles for the analysis of outgoing links, and
165567 articles for the analysis of the incoming links (articles that do
not link to or are not linked from an article with coordinates are not
considered for the analysis).

For 98% of the articles we can predict the location with an error of at
around 1000km in both cases, at least as far as articles go that are
suitable to have some coordinate associated with them, and so long as we
are only relying on predefined coordinates.

A next step would be to use this to associate coordinates with articles
that do not yet have some, and then use the projections as input to
project the location of those articles that do have coordinates, and
then see what if that improves or worsens the error rate.

There seem to be no obvious correlations between the error rate and the
other factors, like how many links there are, what percentage of them
do have coordinates, or the median distances as in the second table
above. It's likely though that there should be some threshold to rule
out that we associate coordinates with, say, articles about algorithms.

What may make a difference is considering where links come from. I have
simply taken all the links, but sometimes links come from transclusions,
such as navigation bars that link to related articles, and in some cases
that accounts for the vast majority of links. That's good when the links
are to articles that are "nearby" or concern "large" objects, but not so
good when they are small, but far apart objects (consider having a navi-
gation bar for "Capitals of European states and territories" but only
have stubs for each of the capitals).
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Friday, 18 June 2010 20:59:24 UTC