Life and Death of Words. Act I. from Bjoern Hoehrmann on 2012-04-02 (www-archive@w3.org from April 2012)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 02 Apr 2012 13:54:19 +0200
To: www-archive@w3.org
Message-ID: <uk4jn7t0a6k2ke8v8k4csio812490rbel0@hive.bjoern.hoehrmann.de>
Now,

  In a better world we would have software where you can load some data
and the software will take a look at the data and give you options what
to do with it, analyzing it, visualizing it, things that make sense for
the data. Similarily, in a better world data would come self-describing
so you don't have to spend a lot of time explaining the data. In this
world however there is no such software and data has to be explained to
the computer all the time.

It's unbelieveable, but it's 2012 and I can't load delimiter-separated
data files from the anglosphere into the german version of Excel if it
uses highly advanced features such as numbers with a fractional part,
unless perhaps I change the operating system configuration. And don't
get me started on charting applications that cannot draw labels so they
do not overlap.

Anyway, being interested in languages I try to have some fun with the
Google N-Gram corpus. In a better world, I would simply have downloaded
the German 1-gram data, fired up some application, loaded the file, and
then be asked what I want to do with it, and then I would Clustering,
and then perhaps I would give a couple of parameters, like that I want
to cluster the words by their curves in the N-Gram Viewer (which shows
the frequency of the 1-gram as percentage of the total gram count over
time) and how many clusters I want, or whatever. And then it would work
for a while and then I'd get some pretty report.

So instead I had to write my own tools for this. Clustering time series
data apparently falls into some "hard" category, as far as I can tell
there aren't, say, modules on CPAN that do it for you, and you do not
have neatly organized Wikipedia articles discussing available methods,
list relevant algorithms, link to educative implementations, you rather
have to buy poorly formatted PDFs for horrendous sums of money and then
give up eventually. I make it a point to skip that step usually...

Since "like in the N-Gram Viewer" was a goal, I figured I should reim-
plement what it does first. In general that's pretty trivial, read in
number, divide by total, plot. The viewer however also has a smoothing
parameter, which is kinda needed because the curves for many grams are
pretty ugly and uninformative when not smoothed at all. That turns out
to be a simple central moving averages normalization (replace values by
the arithmetic mean of the value plus some values before and after it).

Easy enough, just install the relevant CPAN module Math::CMA that has
the not so terribly well named `central_moving_averages` function that
does just that. Except, that's only possible now because I wrote that
module and published it on CPAN. I could not find any implementation at
all anywhere and the easy way to implement it is very slow, and in the
faster alternative you have to juggle with many indices to make sure
you stay within the bounds of the array, which is usually annoying, so
it would have been better to simply copy the code... Anyway, I imple-
mented both the slow and the slightly faster version; the slow version
is used in the test suite, it simply generates random lists and random
smoothing parameters and compares the result of both. Given the number
of CPAN testers, if there is any bug, they should find it soon.

So that gives smoothing. But how to go about the similarity measure? A
key problem is that some words are used much more frequently than other
words, yet they have very similar curves in the N-Gram Viewer. That can
be addressed by normalizing the time series values so they express a
percentage of the maximum. Another problem is that "similarity" is sub-
jective, and I don't have an entirely clear idea of when two words are
very similar and when they are not. That's in part because I can't tell
what effect slight adjustments to a similarity formula might make.

Example: Take a word where the curve forms a triangle with the x-axis.
If you have another word that also forms such a triangle, how similar
are the words? Does it matter when the words emerged? How long they had
been used? How frequently they were used at the peak? Well, I do not
even know if there are such words to begin with, most probably they'd
be names. So as a first order approximation something simple had to do.

Simple here ends up meaning Euclidean Distance, square root of sums of
squared differences. But there are 2.5 million "words" in the corpus,
it's not really feasible to compare them O(n²) with a Perl script on
cheap hardware. I probably don't even have the drive space to store the
data, now that I think about it.

I figured I already know some words with interesting and representative
curves, like "Internet" is quite related to "640x480", and "CDU" is re-
lated to "CSU" (political "sister" parties formed shortly after WWII),
and "roth" and "Theil" are related in that they are obsolete spellings
of "rot" (red) and "Teil" (thing/part) and fell out of use around 1900.

So I, largely randomly, made a list of words that might be interesting,
and computed the similarity of these words to all others. The idea in
part was that this should at least allow finding better such candidates
in the next run (words that are not very similar to any of my choices).
To summarize, for all words in the corpus, compute the distance to a
small selection, where the distance is computed as euclidean distance
between normalized time series.

The input data is organized as tab-delimited values, with the columns
1-gram, year, instances, pages with an instance, books with an instance
and there is secondary file with the sums. So, take those values, di-
vide by the relevant sum, smooth with 3 years before and after, and di-
vide the resulting values by the maximum value for each 1-gram. As it
turns out I used the "pages with an instance" figure. I kind of meant
to use the raw instance count, as that is what the N-Grams Viewer uses,
but there is usually not much of a difference between the two...

I note that the data is problematic, as far as I am aware there is no
claim that the selection of books that are the basis for the corpus,
was somehow representative and when it comes to details there are many
errors, like anachronisms where words appear to have been used before
they actually had been due to dating errors. Of particular note are the
book counts, the last value in this table:

  1908	297450300	853517	3351
  ...
  1917	80570838	240465	1104
  1918	80524121	236012	1058
  ...
  1934	142340686	407973	2513
  ...
  1942	81297327	253600	1080
  1943	97453844	297964	1163
  1944	63297515	190531	804
  1945	48633796	137414	600 
  ...

While it would be interesting to study whether and how language changes
in war times, 600 books is not a lot to go on; currently around 100,000
new books are published in german speaking countries, and even at the
time much higher numbers were usual. It may be that the selection just
so happens to be representative, but without some evidence to that end,
I would attribute "interesting" changes in doubt to the small sample.

So what did I find? Well, nothing much so far, I wrote this while I was
waiting for some scripts to tell me stuff. But so far the approach does
seem fairly reasonable as first order measure. As an example, consider
<http://tinyurl.com/bspre4t>. The reference word that I picked earlier
was "Sexwelle",

  - Sexwelle (sex wave)
  - Pulsare (plural of pulsar)
  - Mondlandungen (plural of moon landing)
  - Forschungsökonomie (could mean various things)
  - Waffentests (plural of weapon test)

They, their curves, are obviously quite similar, except that some of
the words are used more frequently than others. "Weltwährungssystem",
world currency system, also falls into this cluster, but if you add it
to the chart it becomes difficult to see that "Waffentests" is similar.
An open question is how to account for that.

The words most unlike Sexwelle in my system here would be the most
common old ones, the most common being "der" which is essentially a
flat line over the entire period (I use, as an aside, 1800-2006; the
data from before 1800 isn't very good, and there are various problems
with data after 2000, which I guess is why the viewer defaults to 2000;
the data goes to 2008, but 2007 and 2008 have far more books than any
previous year). Similarily, 99% of the 1-grams are more similar to Sex-
welle than "roth" (see above) in this period. If I had used the period
1930-2000 instead, they would be much more similar.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 2 April 2012 11:54:44 UTC