W3C home > Mailing lists > Public > www-archive@w3.org > September 2011

Re: Draft for review: Personal names around the world

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 06 Sep 2011 03:48:01 +0200
To: Sai <w3c@saizai.com>
Cc: www-international@w3.org
Message-ID: <lhpa67d5c8j5lntcevj25pjrjgplj0mmgj@hive.bjoern.hoehrmann.de>
* Sai wrote:
>Does anyone know where I can find some large, computer parsable,
>republishable databases of names from around the world, and/or would
>any of you be interested in helping with this?

Well, there are various problems as for your purposes you would need
many relationships like linking names to geographical regions and to
time even for seemingly simple things like gender, what may be a dis-
tinctly female name at a given time and place might well be used for
males elsewhere. Obviously this is commercially valuable data, so you
don't get sophisticated republishable databases for free, if at all.

An easily usable source would be for instance the German Wikipedia,
but its data is distored due to selection bias, for every female it
has six males, due to its focus on public figures. Plus obviously a
central european bias. It does have the benefit of being easy to use
though (all males are in the category Mann, all females in Frau, and
virtually all biographies have a standardized hidden template with
details like place of birth and separation of certain name parts).

I made a script that collects the first word of the article title as
approximation for "first" name and what percentage of articles are
in the Mann category (as opposed to the Frau category). Some numbers:

  * 400 000 biographies
  * Males 6 : 1 Females
  * 37 000 distinct names
  * top ten names make for 10%
  * top 200 names make for 50%
  * 2 in 3 names occur only once
  * 75% average maleness
  * 1100 names 10%-90% maleness
  * all 5 "Sai" are male (see caveats)

Where 100% maleness is all biographies under the first name are men,
and 0% is all biographies for women. Ambiguous names include Sasha,
Kim, Taylor. I also looked at "last names" which my method treats a
bit worse than the approximation for first names (the last word in
the title might be "junior" for instance). Numbers there are

  * 160 000 distinct names
  * 110 000 names occur only once
  * top 500 names make for 50%
  * More Smiths than Müllers
  * 143 000 have >= 50% maleness
  * all 2 "Sai" are male (see caveats)

There are some interesting ones when you look at maleness. There are
for instance seven 'Anguissola', six of which 16th century female
italian painters, the one male is also an italian who lived a century
after the painters. All Romanowa, Iwanowa, and Jónsdóttir are female,
so you can recover some patterns there.

As for first names, this data source would give around 5000 names you
could reasonably make a solid guess at the gender, that's just 13.5%
of the names in the set (but covers 88% of the listed people).

So this is something that can be done in half an hour (I'd have added
some information about dates and places, but the data extraction tool,
I've looked at the "templatetiger" tool in particular, seems broken,
as it only has details for 160 000 biographies).

For something like last name frequency there are repositories with
crawled data (implying they are limited to the few folks Online) but
as I understand it they typically lack geographic and other details.
For individual countries aggregated census data is often available.
So there is data for something proof-of-concept, but I also note that
such services do exist already; you'd have no trouble finding out, if
I didn't make up the name I go by around here, that I'm a frisian male
carrying a last name that is uncharacteristic as it does not follow
the patronymic naming convention typical for the region, using public
and free of charge services, broken and spammy as they may be.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Tuesday, 6 September 2011 01:48:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:18:39 GMT