- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 17 Apr 2012 03:23:45 +0200
- To: www-archive@w3.org
- Message-ID: <ubepo79ajkh27ibghbhj8h44lp73ddtm3b@hive.bjoern.hoehrmann.de>
So, I took the Google n-gram data for German and computed average frequen- cies of all 1-, 2-, 3-, and 4-grams (adding a space before and after the individual ngrams to capture word boundaries). The n-gram count for each year is normalized by the total ngram count for the year, the fractions are summed up and emitted for each n-gram, then for each sub-word n-gram the values are summed up, and then the values are normalized by the fre- quency of the letter "e", the most common letter in german texts, so you get a table like e 1.00 n 1.59 i 2.08 r 2.22 s 2.53 ... meaning, on average, there are two "e"s for every "i", which round about matches Wikipedia figures. I then made a quick script that looks up the values for individual words at each position (looking backwards for up to a 4-gram), to pick an earlier example: frieeicehe [ f ] 82.11 [ fr ] 606.49 [ fri] 7755.55 [frie] 2720.52 [riee] 43385.54 [ieei] 116176.62 [eeic] 1247642.96 [eice] 617849.22 [iceh] 2910353.91 [cehe] 3791287.61 [ehe ] 2631.34 For every "f" at the beginning of a word you get 82 "e"s on average, for "fr" at the beginning of a word 606 "e"s and so on. You get 3.8 million "e"s before you see a "cehe" anywhere in a german text; at least that is what Google's data suggests. Properly recognized the word would be this: friesische [ f ] 82.11 [ fr ] 606.49 [ fri] 7755.55 [frie] 2720.52 [ries] 3492.13 [iesi] 16797.64 [esis] 5968.66 [sisc] 1534.35 [isch] 74.83 [sche] 59.24 [che ] 91.95 Much lower numbers, meaning you are far more likely to encounter this word than the alphabet soup before. So, as a simple test I computed the arithmetic mean of these frequency values for each word, compared them for words where the Internet Archive and Google disagree, and picked the one with the lower value (on a per-word level, it would also be possible to compare letters directly, and simply fix the letters instead of re- placing whole words, but that is a bit more complicated, like, it needs a good diff algorithm, and as you make replacements, the average fre- quency values would change, which might have to be taken into account). The result is attached, showing the scan data, the winning word, and the guesses from the Internet Archive and Google respectively. There is some noise in the result around hyphenation and punctuation as before, but in the vast majority of cases, this simple metric picked the best possible choice. Exceptions include the second word for instance, "friesisch", but it picked "friedlich". That is not <http://tinyurl.com/czklr7h> very surprising, but in case of picking "theile" over the correct "theils" it might be, as "theils" used to be much more common than "theile". Well, it doesn't look at whole words, but it might be worth studying the case. The case of "Aehatichneit" versus "Aehnlichkctt" is a bit unfortunate, it's "Aehnlichkeit", so the latter is closer on the whole, but the com- bination "chkctt", six consonants in a row - there are only a few three- consonants combinations in german that appear relatively often (e.g., "sch", "cht", "rst", "str", "spr", "tst") - is quite damning. In total I count 8 cases where this picked the worse option, 1 in 20, 5 per cent. Cases were both don't have it quite right are probably around 20% with this sample page. This basically confirms my initial suspicion that you can correct many of the errors simply by picking the "more german" word where the Inter- net Archive and Google disagree. My derived n-gram frequencies are a- vailable from <https://github.com/hoehrmann/ngrams-data>. I note that I only picked the 1-gram data for this, there are no n-grams across words, no -- to pick something english -- "I am" even though that's a 4-gram. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments
- text/html attachment: ngram-based-error-correction.html
Received on Tuesday, 17 April 2012 01:24:10 UTC