- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 02 Apr 2012 13:54:19 +0200
- To: www-archive@w3.org
Now, In a better world we would have software where you can load some data and the software will take a look at the data and give you options what to do with it, analyzing it, visualizing it, things that make sense for the data. Similarily, in a better world data would come self-describing so you don't have to spend a lot of time explaining the data. In this world however there is no such software and data has to be explained to the computer all the time. It's unbelieveable, but it's 2012 and I can't load delimiter-separated data files from the anglosphere into the german version of Excel if it uses highly advanced features such as numbers with a fractional part, unless perhaps I change the operating system configuration. And don't get me started on charting applications that cannot draw labels so they do not overlap. Anyway, being interested in languages I try to have some fun with the Google N-Gram corpus. In a better world, I would simply have downloaded the German 1-gram data, fired up some application, loaded the file, and then be asked what I want to do with it, and then I would Clustering, and then perhaps I would give a couple of parameters, like that I want to cluster the words by their curves in the N-Gram Viewer (which shows the frequency of the 1-gram as percentage of the total gram count over time) and how many clusters I want, or whatever. And then it would work for a while and then I'd get some pretty report. So instead I had to write my own tools for this. Clustering time series data apparently falls into some "hard" category, as far as I can tell there aren't, say, modules on CPAN that do it for you, and you do not have neatly organized Wikipedia articles discussing available methods, list relevant algorithms, link to educative implementations, you rather have to buy poorly formatted PDFs for horrendous sums of money and then give up eventually. I make it a point to skip that step usually... Since "like in the N-Gram Viewer" was a goal, I figured I should reim- plement what it does first. In general that's pretty trivial, read in number, divide by total, plot. The viewer however also has a smoothing parameter, which is kinda needed because the curves for many grams are pretty ugly and uninformative when not smoothed at all. That turns out to be a simple central moving averages normalization (replace values by the arithmetic mean of the value plus some values before and after it). Easy enough, just install the relevant CPAN module Math::CMA that has the not so terribly well named `central_moving_averages` function that does just that. Except, that's only possible now because I wrote that module and published it on CPAN. I could not find any implementation at all anywhere and the easy way to implement it is very slow, and in the faster alternative you have to juggle with many indices to make sure you stay within the bounds of the array, which is usually annoying, so it would have been better to simply copy the code... Anyway, I imple- mented both the slow and the slightly faster version; the slow version is used in the test suite, it simply generates random lists and random smoothing parameters and compares the result of both. Given the number of CPAN testers, if there is any bug, they should find it soon. So that gives smoothing. But how to go about the similarity measure? A key problem is that some words are used much more frequently than other words, yet they have very similar curves in the N-Gram Viewer. That can be addressed by normalizing the time series values so they express a percentage of the maximum. Another problem is that "similarity" is sub- jective, and I don't have an entirely clear idea of when two words are very similar and when they are not. That's in part because I can't tell what effect slight adjustments to a similarity formula might make. Example: Take a word where the curve forms a triangle with the x-axis. If you have another word that also forms such a triangle, how similar are the words? Does it matter when the words emerged? How long they had been used? How frequently they were used at the peak? Well, I do not even know if there are such words to begin with, most probably they'd be names. So as a first order approximation something simple had to do. Simple here ends up meaning Euclidean Distance, square root of sums of squared differences. But there are 2.5 million "words" in the corpus, it's not really feasible to compare them O(n²) with a Perl script on cheap hardware. I probably don't even have the drive space to store the data, now that I think about it. I figured I already know some words with interesting and representative curves, like "Internet" is quite related to "640x480", and "CDU" is re- lated to "CSU" (political "sister" parties formed shortly after WWII), and "roth" and "Theil" are related in that they are obsolete spellings of "rot" (red) and "Teil" (thing/part) and fell out of use around 1900. So I, largely randomly, made a list of words that might be interesting, and computed the similarity of these words to all others. The idea in part was that this should at least allow finding better such candidates in the next run (words that are not very similar to any of my choices). To summarize, for all words in the corpus, compute the distance to a small selection, where the distance is computed as euclidean distance between normalized time series. The input data is organized as tab-delimited values, with the columns 1-gram, year, instances, pages with an instance, books with an instance and there is secondary file with the sums. So, take those values, di- vide by the relevant sum, smooth with 3 years before and after, and di- vide the resulting values by the maximum value for each 1-gram. As it turns out I used the "pages with an instance" figure. I kind of meant to use the raw instance count, as that is what the N-Grams Viewer uses, but there is usually not much of a difference between the two... I note that the data is problematic, as far as I am aware there is no claim that the selection of books that are the basis for the corpus, was somehow representative and when it comes to details there are many errors, like anachronisms where words appear to have been used before they actually had been due to dating errors. Of particular note are the book counts, the last value in this table: 1908 297450300 853517 3351 ... 1917 80570838 240465 1104 1918 80524121 236012 1058 ... 1934 142340686 407973 2513 ... 1942 81297327 253600 1080 1943 97453844 297964 1163 1944 63297515 190531 804 1945 48633796 137414 600 ... While it would be interesting to study whether and how language changes in war times, 600 books is not a lot to go on; currently around 100,000 new books are published in german speaking countries, and even at the time much higher numbers were usual. It may be that the selection just so happens to be representative, but without some evidence to that end, I would attribute "interesting" changes in doubt to the small sample. So what did I find? Well, nothing much so far, I wrote this while I was waiting for some scripts to tell me stuff. But so far the approach does seem fairly reasonable as first order measure. As an example, consider <http://tinyurl.com/bspre4t>. The reference word that I picked earlier was "Sexwelle", - Sexwelle (sex wave) - Pulsare (plural of pulsar) - Mondlandungen (plural of moon landing) - Forschungsökonomie (could mean various things) - Waffentests (plural of weapon test) They, their curves, are obviously quite similar, except that some of the words are used more frequently than others. "Weltwährungssystem", world currency system, also falls into this cluster, but if you add it to the chart it becomes difficult to see that "Waffentests" is similar. An open question is how to account for that. The words most unlike Sexwelle in my system here would be the most common old ones, the most common being "der" which is essentially a flat line over the entire period (I use, as an aside, 1800-2006; the data from before 1800 isn't very good, and there are various problems with data after 2000, which I guess is why the viewer defaults to 2000; the data goes to 2008, but 2007 and 2008 have far more books than any previous year). Similarily, 99% of the 1-grams are more similar to Sex- welle than "roth" (see above) in this period. If I had used the period 1930-2000 instead, they would be much more similar. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 2 April 2012 11:54:44 UTC