W3C home > Mailing lists > Public > www-archive@w3.org > June 2010

Re: Guessing geographical locations for Wikipedia article subjects

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 19 Jun 2010 20:53:51 +0200
To: Toby Inkster <tai@g5n.co.uk>
Cc: www-archive@w3.org
Message-ID: <av2q169evs3lacl1f0bhgmmvk6csr696bs@hive.bjoern.hoehrmann.de>
* Toby Inkster wrote:
>On Fri, 18 Jun 2010 22:58:46 +0200
>Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>> For 98% of the articles we can predict the location with an error of
>> at around 1000km in both cases
>Interesting; 1000km is quite a wide margin of error though. It's about
>the distance from where I live, just outside Brighton near the English
>south coast, to Prague in the Czech Republic. There are two whole
>countries in between - one of them being Germany (not a small place by
>any means) - and a fairly big stretch of water too.

There is some inherent uncertainty to the location of some objects, yet
the german Wikipedia associates coordinates with those articles. Russia
for instance, or the pacific ocean. The coordinate template does allow
specifying a "dimension" but that is usually omitted. The location in
those cases is rather arbitrary (de.wp and en.wp put the pacific ocean
at rather different positions, for instance), so those are not errors.

Without knowing the uncertainty for the positions however, I cannot tell
how many of the far-off guesses are rather natural, and how many really
are bad guesses with a large error. Besides, for 80% of the cases, it's
within 40-70km depending on which kind of link you use, that's very good
for this rather simple approach.

A better approach would probably give links different weights depending
on where they are on the page, if at the beginning, a lot of weight and
if transcluded a lot less, but to do that one would have to parse the
pages, which is rather expensive compared to simply going through the db
table dumps.
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Saturday, 19 June 2010 18:54:28 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:33:50 UTC