Using n-gram frequencies for error correction from Bjoern Hoehrmann on 2012-04-17 (www-archive@w3.org from April 2012)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 17 Apr 2012 03:23:45 +0200
To: www-archive@w3.org
Message-ID: <ubepo79ajkh27ibghbhj8h44lp73ddtm3b@hive.bjoern.hoehrmann.de>
So,

  I took the Google n-gram data for German and computed average frequen-
cies of all 1-, 2-, 3-, and 4-grams (adding a space before and after the
individual ngrams to capture word boundaries). The n-gram count for each
year is normalized by the total ngram count for the year, the fractions
are summed up and emitted for each n-gram, then for each sub-word n-gram
the values are summed up, and then the values are normalized by the fre-
quency of the letter "e", the most common letter in german texts, so you
get a table like

  e       1.00
  n       1.59
  i       2.08
  r       2.22
  s       2.53
  ...

meaning, on average, there are two "e"s for every "i", which round about
matches Wikipedia figures. I then made a quick script that looks up the
values for individual words at each position (looking backwards for up
to a 4-gram), to pick an earlier example:

  frieeicehe
    [ f  ]              82.11
    [ fr ]             606.49
    [ fri]            7755.55
    [frie]            2720.52
    [riee]           43385.54
    [ieei]          116176.62
    [eeic]         1247642.96
    [eice]          617849.22
    [iceh]         2910353.91
    [cehe]         3791287.61
    [ehe ]            2631.34

For every "f" at the beginning of a word you get 82 "e"s on average, for
"fr" at the beginning of a word 606 "e"s and so on. You get 3.8 million
"e"s before you see a "cehe" anywhere in a german text; at least that is
what Google's data suggests. Properly recognized the word would be this:

  friesische
    [ f  ]              82.11
    [ fr ]             606.49
    [ fri]            7755.55
    [frie]            2720.52
    [ries]            3492.13
    [iesi]           16797.64
    [esis]            5968.66
    [sisc]            1534.35
    [isch]              74.83
    [sche]              59.24
    [che ]              91.95

Much lower numbers, meaning you are far more likely to encounter this
word than the alphabet soup before. So, as a simple test I computed the
arithmetic mean of these frequency values for each word, compared them
for words where the Internet Archive and Google disagree, and picked the
one with the lower value (on a per-word level, it would also be possible
to compare letters directly, and simply fix the letters instead of re-
placing whole words, but that is a bit more complicated, like, it needs
a good diff algorithm, and as you make replacements, the average fre-
quency values would change, which might have to be taken into account).

The result is attached, showing the scan data, the winning word, and the
guesses from the Internet Archive and Google respectively. There is some
noise in the result around hyphenation and punctuation as before, but in
the vast majority of cases, this simple metric picked the best possible
choice. Exceptions include the second word for instance, "friesisch",
but it picked "friedlich". That is not <http://tinyurl.com/czklr7h> very
surprising, but in case of picking "theile" over the correct "theils" it
might be, as "theils" used to be much more common than "theile". Well,
it doesn't look at whole words, but it might be worth studying the case.

The case of "Aehatichneit" versus "Aehnlichkctt" is a bit unfortunate,
it's "Aehnlichkeit", so the latter is closer on the whole, but the com-
bination "chkctt", six consonants in a row - there are only a few three-
consonants combinations in german that appear relatively often (e.g.,
"sch", "cht", "rst", "str", "spr", "tst") - is quite damning. In total I
count 8 cases where this picked the worse option, 1 in 20, 5 per cent.
Cases were both don't have it quite right are probably around 20% with
this sample page.

This basically confirms my initial suspicion that you can correct many
of the errors simply by picking the "more german" word where the Inter-
net Archive and Google disagree. My derived n-gram frequencies are a-
vailable from <https://github.com/hoehrmann/ngrams-data>. I note that I
only picked the 1-gram data for this, there are no n-grams across words,
no -- to pick something english -- "I am" even though that's a 4-gram.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments

text/html attachment: ngram-based-error-correction.html
Received on Tuesday, 17 April 2012 01:24:10 UTC