- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 06 Sep 2011 03:48:01 +0200
- To: Sai <w3c@saizai.com>
- Cc: www-international@w3.org
* Sai wrote: >Does anyone know where I can find some large, computer parsable, >republishable databases of names from around the world, and/or would >any of you be interested in helping with this? Well, there are various problems as for your purposes you would need many relationships like linking names to geographical regions and to time even for seemingly simple things like gender, what may be a dis- tinctly female name at a given time and place might well be used for males elsewhere. Obviously this is commercially valuable data, so you don't get sophisticated republishable databases for free, if at all. An easily usable source would be for instance the German Wikipedia, but its data is distored due to selection bias, for every female it has six males, due to its focus on public figures. Plus obviously a central european bias. It does have the benefit of being easy to use though (all males are in the category Mann, all females in Frau, and virtually all biographies have a standardized hidden template with details like place of birth and separation of certain name parts). I made a script that collects the first word of the article title as approximation for "first" name and what percentage of articles are in the Mann category (as opposed to the Frau category). Some numbers: * 400 000 biographies * Males 6 : 1 Females * 37 000 distinct names * top ten names make for 10% * top 200 names make for 50% * 2 in 3 names occur only once * 75% average maleness * 1100 names 10%-90% maleness * all 5 "Sai" are male (see caveats) Where 100% maleness is all biographies under the first name are men, and 0% is all biographies for women. Ambiguous names include Sasha, Kim, Taylor. I also looked at "last names" which my method treats a bit worse than the approximation for first names (the last word in the title might be "junior" for instance). Numbers there are * 160 000 distinct names * 110 000 names occur only once * top 500 names make for 50% * More Smiths than Müllers * 143 000 have >= 50% maleness * all 2 "Sai" are male (see caveats) There are some interesting ones when you look at maleness. There are for instance seven 'Anguissola', six of which 16th century female italian painters, the one male is also an italian who lived a century after the painters. All Romanowa, Iwanowa, and Jónsdóttir are female, so you can recover some patterns there. As for first names, this data source would give around 5000 names you could reasonably make a solid guess at the gender, that's just 13.5% of the names in the set (but covers 88% of the listed people). So this is something that can be done in half an hour (I'd have added some information about dates and places, but the data extraction tool, I've looked at the "templatetiger" tool in particular, seems broken, as it only has details for 160 000 biographies). For something like last name frequency there are repositories with crawled data (implying they are limited to the few folks Online) but as I understand it they typically lack geographic and other details. For individual countries aggregated census data is often available. So there is data for something proof-of-concept, but I also note that such services do exist already; you'd have no trouble finding out, if I didn't make up the name I go by around here, that I'm a frisian male carrying a last name that is uncharacteristic as it does not follow the patronymic naming convention typical for the region, using public and free of charge services, broken and spammy as they may be. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 6 September 2011 01:48:24 UTC