Some more OCR correction notes

Hi,

  Some weeks ago I looked into correcting OCR errors in scanned public
domain books. http://archive.org/details/dienordfriesisc01bendgoog my
particular interest. The basic idea stemmed from the fact that there
are many easily detectable errors, like confusing `d` and `á` and it
seemed like clustering the glyphs with a simple similarity function
should do the trick, so you could fix misrecognized `á`s en masse. As
it turns out, indeed, that works well. Some notes about problems I've
found:

  * The OCR software likes to drop geometry, so the grapheme boundary
    data is sometimes wrong; this is particularily a problem with um-
    lauts and other non-ASCII characters, but also with `i`s.

  * There are more segmentation errors than I originally thought from
    cursory inspection. Problem cases are things like "rn" which are
    read as "m", "m" itself might end up as "in", even if there is no
    dot (which is then hard to discover because the dotless `i`s you
    get due to the previous problem looks very similar).

    This is quite bad because you cannot actually see this error if
    you just look at individual graphemes, so in some cases you can't
    mass-confirm or mass-change clusters.

  * Some characters are extremely hard to tell apart with simple com-
    parison functions. My approach largely ignores "where" the main
    error region is, and in the fonts in the book, "c" and "e" often
    differ only by a few pixels, in a very consistent region though.
    Same goes for "h" and "b" and "u" and "n" and also some badly seg-
    mented grapheme parts.

  * My comparison function, compute absolute differences, erode, and
    then count bad pixels, aligns graphemes at the top left edge. Since
    sometimes there is overlap, like with cursive text, makes this not
    the best choice, as, say, an "a" may be pushed far down due to some
    noise at the edges. Originally this seemed to make sense since I was
    after "a" vs "á", "à", "ä" and a-macron confusions, which should be
    very different to make telling them apart easy, but since the seg-
    mentation often drops umlauts and diacritical marks, knowing that
    something is "a"-ish with bottom alignment of the two images is more
    valuable, since you can then redo segmentation and might even be
    able to use other error correction methods more easily.

http://www.websitedev.de/temp/dienordfriesisc01bendgoog-01.html.gz is a
snapshot of the text with some error markers. To detect errors I extract
and use some external features, here in particular for each character

  * A book-internal n-gram score. Basically I look at how many "sch" are
    in the book, assuming everything is properly recognized, and use the
    count as one value for any "h" following "sc".

  * The cluster size, which is roughly "how many 'c's look very similar
    to this 'c'" with some caveats (the clustering code assigns glyph
    images to the most recently used cluster where the representative
    image is similar beyond some treshold, so it's not necessarily a-
    ssigned to the "most" similar cluster)

  * The book-internal word-count of the word the character appears in,
    so if there are very many "der" and something else looks also like
    "der", then the characters in it are probably properly recognized.
    If, on the other hand, the word does not occur anywhere else in
    the book, odds are some characters, possibly "this one", is bad

  * An external "german" word-count, specifically I used the German
    Google N-Gram data for the period 1800-1899 inclusive; there have
    been a number of spelling reforms towards the end of the century,
    you don't want to use more modern data; except that the Google
    scans have very many OCR errors in the older books. A good example
    is "stelien", which is not a german word, but "stellen" and "steh-
    len" are, and "li" is often confused with "h".

I normalize these values to the range 0..1 and then take the average,
essentially. So in the example above, red letters indicate errors that
are "somewhere around here". If you ignore the english text at the top
which is in a tiny font and entirely uninteresting to me, this works
extremely well, if you take the words where one of the characters has
"one of the worst" scores, that's basically guaranteed to be wrong for
unfortunately thousands of words.

There are many other features you can extract to find errors. One thing
I wanted to do for instance was to restore italics, which the OCR soft-
ware does not seem to detect. That turns out to be very easy, since I
put similar characters into clusters, I can then make a graph that en-
codes "grapheme images in cluster X appear in words that also feature
images in cluster Y", meaning, if intra-word font changes are uncommon,
you can cluster the clusters into different "fonts". Works extremely
well, the HTML rendering of the book above features the result of that.

So, and if you have "fonts", you have "an 'J' in this font is usually
so and so many pixels high and wide" and if there is a "J" that has very
different dimensions even though it's in the same "font", well, that's
probably an error nearby. Same goes for placement relative to the base-
line (or rather the word boundaries, I have not tried to detect base-
lines), which is how I mass-corrected much of the interpunctuation.

It's easy to come up with plenty of heuristical rules for this, consider
if you actually correct stuff, you would get a list of common errors, so
if "li" is often confused with "h", and a word with "li" or "h" looks
wrong, you could try to compute the scores assuming you replaced one
with the other, or you could actually extract the combination of "li"
and compare that to other "h"s... I don't have a good logic system that
would allow me to easily specify rules and tests and so on, and guide
some of the error correction, and I'm afraid without some statistical
information automatically correcting errors... is quite difficult, with
my simple approach so far.

Consider "eb" versus "ch", there are plenty of words where this could go
either way, say german "lieben", to love, and english "lichen". Now, for
mostly german text the former would be more likely, but it also features
english, and the difference between the two might be something like four
pixels in very specific locations, that's just not enough to go on with-
out risking to make things worse. Actually, some of my corrections made
things worse, so there are errors in the HTML version above that are not
in the original.

One thing I've tried, with much pain, is to train Tesseract 3.01 to make
things better, I manually corrected some of the pages so there is plenty
of "good" data, including that I tried to correct segmentation errors in
all of the data, though not very thoroughly. Well, Tesseract is, well,
not very good software. Plenty of crashes, real ones and assert() ones,
very inconsistent input formats, bizarre limitations, and naturally poor
documentation, actually made a long list of complaints on IRC... but, I
don't give up so easily, so I made script that extract box files and a
couple of scripts that invokve the relevant Tesseract programs for the
training based on them, and, well it did produce usable results, but the
hard things, despite offering some word lists and other optional data,
like typical confusions, it, too, could not deal with e<>c and the other
common issues either. Can't say whether it's actually better or worse, I
did look only at a couple of examples, it's kinda hard to judge if you
do not have verified reference data, but it did not look promising.

Still, as the text is right now, without much effort, you can actually
at least grep things and usually find words you know are in there, that
did not work very well with the original input. Anyway, just so I don't
forget things,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Friday, 15 June 2012 02:26:01 UTC