- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Fri, 15 Jun 2012 04:25:32 +0200
- To: www-archive@w3.org
Hi, Some weeks ago I looked into correcting OCR errors in scanned public domain books. http://archive.org/details/dienordfriesisc01bendgoog my particular interest. The basic idea stemmed from the fact that there are many easily detectable errors, like confusing `d` and `á` and it seemed like clustering the glyphs with a simple similarity function should do the trick, so you could fix misrecognized `á`s en masse. As it turns out, indeed, that works well. Some notes about problems I've found: * The OCR software likes to drop geometry, so the grapheme boundary data is sometimes wrong; this is particularily a problem with um- lauts and other non-ASCII characters, but also with `i`s. * There are more segmentation errors than I originally thought from cursory inspection. Problem cases are things like "rn" which are read as "m", "m" itself might end up as "in", even if there is no dot (which is then hard to discover because the dotless `i`s you get due to the previous problem looks very similar). This is quite bad because you cannot actually see this error if you just look at individual graphemes, so in some cases you can't mass-confirm or mass-change clusters. * Some characters are extremely hard to tell apart with simple com- parison functions. My approach largely ignores "where" the main error region is, and in the fonts in the book, "c" and "e" often differ only by a few pixels, in a very consistent region though. Same goes for "h" and "b" and "u" and "n" and also some badly seg- mented grapheme parts. * My comparison function, compute absolute differences, erode, and then count bad pixels, aligns graphemes at the top left edge. Since sometimes there is overlap, like with cursive text, makes this not the best choice, as, say, an "a" may be pushed far down due to some noise at the edges. Originally this seemed to make sense since I was after "a" vs "á", "à", "ä" and a-macron confusions, which should be very different to make telling them apart easy, but since the seg- mentation often drops umlauts and diacritical marks, knowing that something is "a"-ish with bottom alignment of the two images is more valuable, since you can then redo segmentation and might even be able to use other error correction methods more easily. http://www.websitedev.de/temp/dienordfriesisc01bendgoog-01.html.gz is a snapshot of the text with some error markers. To detect errors I extract and use some external features, here in particular for each character * A book-internal n-gram score. Basically I look at how many "sch" are in the book, assuming everything is properly recognized, and use the count as one value for any "h" following "sc". * The cluster size, which is roughly "how many 'c's look very similar to this 'c'" with some caveats (the clustering code assigns glyph images to the most recently used cluster where the representative image is similar beyond some treshold, so it's not necessarily a- ssigned to the "most" similar cluster) * The book-internal word-count of the word the character appears in, so if there are very many "der" and something else looks also like "der", then the characters in it are probably properly recognized. If, on the other hand, the word does not occur anywhere else in the book, odds are some characters, possibly "this one", is bad * An external "german" word-count, specifically I used the German Google N-Gram data for the period 1800-1899 inclusive; there have been a number of spelling reforms towards the end of the century, you don't want to use more modern data; except that the Google scans have very many OCR errors in the older books. A good example is "stelien", which is not a german word, but "stellen" and "steh- len" are, and "li" is often confused with "h". I normalize these values to the range 0..1 and then take the average, essentially. So in the example above, red letters indicate errors that are "somewhere around here". If you ignore the english text at the top which is in a tiny font and entirely uninteresting to me, this works extremely well, if you take the words where one of the characters has "one of the worst" scores, that's basically guaranteed to be wrong for unfortunately thousands of words. There are many other features you can extract to find errors. One thing I wanted to do for instance was to restore italics, which the OCR soft- ware does not seem to detect. That turns out to be very easy, since I put similar characters into clusters, I can then make a graph that en- codes "grapheme images in cluster X appear in words that also feature images in cluster Y", meaning, if intra-word font changes are uncommon, you can cluster the clusters into different "fonts". Works extremely well, the HTML rendering of the book above features the result of that. So, and if you have "fonts", you have "an 'J' in this font is usually so and so many pixels high and wide" and if there is a "J" that has very different dimensions even though it's in the same "font", well, that's probably an error nearby. Same goes for placement relative to the base- line (or rather the word boundaries, I have not tried to detect base- lines), which is how I mass-corrected much of the interpunctuation. It's easy to come up with plenty of heuristical rules for this, consider if you actually correct stuff, you would get a list of common errors, so if "li" is often confused with "h", and a word with "li" or "h" looks wrong, you could try to compute the scores assuming you replaced one with the other, or you could actually extract the combination of "li" and compare that to other "h"s... I don't have a good logic system that would allow me to easily specify rules and tests and so on, and guide some of the error correction, and I'm afraid without some statistical information automatically correcting errors... is quite difficult, with my simple approach so far. Consider "eb" versus "ch", there are plenty of words where this could go either way, say german "lieben", to love, and english "lichen". Now, for mostly german text the former would be more likely, but it also features english, and the difference between the two might be something like four pixels in very specific locations, that's just not enough to go on with- out risking to make things worse. Actually, some of my corrections made things worse, so there are errors in the HTML version above that are not in the original. One thing I've tried, with much pain, is to train Tesseract 3.01 to make things better, I manually corrected some of the pages so there is plenty of "good" data, including that I tried to correct segmentation errors in all of the data, though not very thoroughly. Well, Tesseract is, well, not very good software. Plenty of crashes, real ones and assert() ones, very inconsistent input formats, bizarre limitations, and naturally poor documentation, actually made a long list of complaints on IRC... but, I don't give up so easily, so I made script that extract box files and a couple of scripts that invokve the relevant Tesseract programs for the training based on them, and, well it did produce usable results, but the hard things, despite offering some word lists and other optional data, like typical confusions, it, too, could not deal with e<>c and the other common issues either. Can't say whether it's actually better or worse, I did look only at a couple of examples, it's kinda hard to judge if you do not have verified reference data, but it did not look promising. Still, as the text is right now, without much effort, you can actually at least grep things and usually find words you know are in there, that did not work very well with the original input. Anyway, just so I don't forget things, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Friday, 15 June 2012 02:26:01 UTC