- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Fri, 18 Jun 2010 22:58:46 +0200
- To: www-archive@w3.org
Hi, Many Wikipedia articles have associated geographical coordinates. That is however usually limited to stationary objects, even though many more things have a very strong connection to a particular region, a regional tradition for instance. In order to associate some location with more of the articles, some method to derive them is needed. The method that comes to mind is network analysis, how one article re- lates to other articles by linking other articles, by being linked from other articles, by being in the same category as others articles, by be- ing edited by authors who focus on a particular region, and so on. To get an idea how well a location can be predicted, I've analyzed the incoming and outgoing links of articles that already have coordinates associated to them, and compared my projected locations to the pre-de- fined location. To project the location, I've simply determined the median center of the linked or linking article's coordinates using an iterated approximation algorithm and compared the result to the great circle (haversine) distance. For the german language version of Wikipedia the distances between the projected and the pre-defined location are distributed as follows: +---------------+----------+----------+ | % of Articles | Outgoing | Incoming | +---------------+----------+----------+ | 10% | 1km | 1km | | 20% | 2km | 2km | | 30% | 4km | 3km | | 40% | 7km | 5km | | 50% | 11km | 7km | | 60% | 18km | 12km | | 70% | 31km | 20km | | 80% | 69km | 39km | | 90% | 203km | 135km | | 100% | 19759km | 20016km | +---------------+----------+----------+ The median great circle distances between the projected location and the coordinates of linked and linking articles is distributed as follows: +---------------+----------+----------+ | % of Articles | Outgoing | Incoming | +---------------+----------+----------+ | 10% | 2km | 0km | | 20% | 4km | 1km | | 30% | 7km | 3km | | 40% | 12km | 4km | | 50% | 16km | 7km | | 60% | 26km | 12km | | 70% | 48km | 18km | | 80% | 106km | 35km | | 90% | 259km | 128km | | 100% | 19350km | 18879km | +---------------+----------+----------+ This is based on 181796 articles for the analysis of outgoing links, and 165567 articles for the analysis of the incoming links (articles that do not link to or are not linked from an article with coordinates are not considered for the analysis). For 98% of the articles we can predict the location with an error of at around 1000km in both cases, at least as far as articles go that are suitable to have some coordinate associated with them, and so long as we are only relying on predefined coordinates. A next step would be to use this to associate coordinates with articles that do not yet have some, and then use the projections as input to project the location of those articles that do have coordinates, and then see what if that improves or worsens the error rate. There seem to be no obvious correlations between the error rate and the other factors, like how many links there are, what percentage of them do have coordinates, or the median distances as in the second table above. It's likely though that there should be some threshold to rule out that we associate coordinates with, say, articles about algorithms. What may make a difference is considering where links come from. I have simply taken all the links, but sometimes links come from transclusions, such as navigation bars that link to related articles, and in some cases that accounts for the vast majority of links. That's good when the links are to articles that are "nearby" or concern "large" objects, but not so good when they are small, but far apart objects (consider having a navi- gation bar for "Capitals of European states and territories" but only have stubs for each of the capitals). -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Friday, 18 June 2010 20:59:24 UTC