Random notes on things, and optical character recognition from Bjoern Hoehrmann on 2012-05-14 (www-archive@w3.org from May 2012)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 14 May 2012 04:41:24 +0200
To: www-archive@w3.org
Message-ID: <c8j0r71iv2b90488n71kh9vs211k1v7ne7@hive.bjoern.hoehrmann.de>
So,

  How hard can it be? I tend to be driven by that, which may not be
http://tinyurl.com/6rrjskn very surprising as I seem to have grown up
with that question; http://cutycapt.sf.net/ for instance exist main-
ly because people kept asking me for a http://iecapt.sf.net for Linux,
so I sat down some weekend, looking through the Qt documentation, I
confirmed that it seems to expose all the needed functionality, then I
configured and built Qt, encountering various bugs in Qt and reporting
them, coded up something that seemed to work, did a little cleanup,
registered a SourceForge project for it, made a brief web page for it,
installed a recent Ubuntu version in a virtual machine, tried to build
it there, fixed the bugs that showed up, uploaded it all, and then I
could answer the question: despite having no experience with Qt, it's
not really hard.

That isn't without problems, even though I found CutyCapt useful in a
number of situations since, my interest really ended there, and I've
not done much to maintain CutyCapt past that proof-of-concept stage,
which probably discouraged others for some time to roll their own and
maintain it with more interest than I do. Well, I am not so worried
about that actually, what I find more unfortunate was that I did not
really think or care about how interested people were going to be in 
CutyCapt, if I had, I would have done a number of things differently,
in pseudo-commercial terms, I would have cared more about branding,
about setting up some community, recruiting some development help by
way of such a community, paying more attention to bug reports and the
various people sent me... You know, back in school, people were turn-
ing 18 years old or whatever, and friends would want to do something
special, like presenting a custom drink mixed according to some "fa-
mily recipe", and I'd say this needs a proper custom brand name, you
need some logo design on the bottle, a good slogan, some custom story
on the backside, make it an event, and then we sat down and did just
that. So instead of "Do you remember when we made that custom drink
and made it a present to someone or other?" "Oh yeah, that was nice,
and someone or other liked it." you actually got something resembling
a story about the whole thing.

Right, the idea being that you actually made a finished product, not
just a proof-of-concept, not something purely ad-hoc, not just the pure
technicality of "we mixed you a drink", but a bottle with a backside
explaining

  Nach alter nordfriesischer Tradition wird zwischen Dagebüll und Fah-
  retoft das DAGEBÜLLER WASSER täglich frisch der Nordsee entnommen...

Anyway, optical character recognition has been a recent "how hard can it
be" thing for me, and I wanted to record some notes on that. The art, in
its digital form, is something like 80 years old, so one would think it
is a solved problem by now, but given that neither Google nor the Inter-
net Archive can provide plain text versions of public domain books they
distribute that are in any way readable, at least not the ones I care a-
bout, someone has to answer the question.

Actually, my underlying motivation is somewhat different. Let's say you
want to preserve a dying language, what would you do? Well, death would
be having no speakers anymore (and no writers really, otherwise you'd
have to consider modern high german, sometimes misleading called Stan-
dard German, to have been dead through much of its life, since it has
been a written language, not so much a spoken one, for much of its life)
so you would want more speakers, more writers, more readers, and to get
that, you would need education, and education depends on having access
to relevant educational information, on a very basic level that would
mean things like a dictionary.

Right, if you ask your favourite search engine about a latin word you do
not know, it will probably come up with texts that use it, would link to
dictionaries, might even offer to translate it, and so on. It might find
tools that analyze words morphologically, perhaps even spell checkers, a
whole ecosystem of useful resources.

That's how it should work, right, but not all languages are so lucky. In
my particular case your favourite search engine would usually turn up no
results unless it's an extremely basic word or if some other language is
using the "same" word. There is no dictionary online for my semi-native
language, there are only exactly two "wordlists" with poor coverage and
unclear "licensing terms", and the bigger of the of the two maps between
my language and a related one with similar problems. That language in
turn does have online dictionaries, but they are somewhere between very
insufficient and very unfree making them not very useful. The most use-
ful one maps to a language with around 30 million speakers, so if you
wanted to map the wordlist into, say, english, you would have to go over
three hops, which would likely produce horrible results, and you could
not publish the result either, at least not without a serious talk with
a qualified lawyer.

Now there are "dictionaries" in wood media form, and I use the plural
lightly here, but they are not online with proper licensing terms, so
either nobody knows how to do that, nobody had the idea, or there are
knowledge hoarding laws that stand in the way. I don't care which it is,
but there is one exception in the form of a public domain book. It does
not use modern orthography, but how hard can it be to map between its
style and more modern forms? I mean to answer that, so...

Optical character recognition. As noted, the Internet Archive makes its
scanning data available in forms that allow for post-processing, and as
there are many errors in the plain text versions that should not really
happen, given that the graphemes in the relevant book look very much a-
like, I figured it might be possible to use some supervised technique to
fix most of the errors.

The general idea is that you would compare the graphemes to each other
and use some sort of clustering algorithm to group them, so supervisors
can correct many errors with little effort. For that I came up with a
simple similarity measure. Basically you make two scan images the same
size, then run something like ImageMagick's `convert` ala

  % convert a.png b.png -compose difference -composite -morphology
      erode diamond:1 out.png

and then determine the ratio of black pixels in the output image. This
would compute the absolute difference between all the pixels (`xor` the
pixels as the images are monochrome) and then "erode" the result. With
the "diamond:1" kernel, that means a pixel is made black if any of the
pixels directly above, below, right, or left are black. To illustrate,
if the difference between two images is black with a single white line,
which might happen if you compare two "I"s, one offset by one pixel, or
"thicker" by one pixel, the entire line would disappear, because all the
pixels have a black neighbour. On the other hand, if you, say, compare a
dotless "i" with an "i", the dot would probably be more like a circle,
let's say a 3x3 white square, in the difference, where this would retain
the white pixel in the middle, as all neighbours are also white.

I have found that if the amount of white pixels in the difference is a-
round 0.5%, you can be fairly certain that they are really "the same",
so you put images like that into the same cluster directly.

That's pretty strict though, you would end up with too many clusters, a
particular problem being that sometimes there is some overlap, consider
in italics a string like "Lj", the "L" might extend a bit into the "j",
and the "j" might extend a bit into the "L", if you cut them into rect-
angles. So after this strict clustering, you may want to merge clusters
that "on average" are quite similar.

The measure I've used is that, if you randomly pick 10 images from two
clusters, that is, 10 from each giving 10 pairs to compare, and if they
are on average 98% or more "similar" (meaning 2% or less white pixels
in the difference image as discussed above), you can safely merge them.

I did this first with the bigger clusters to keep the number of compari-
sons down and the number of members high (so it's actually hard for very
"rare" characters to become part of bigger clusters), and then for the
very small clusters. I exploited that the OCR software the Internet Ar-
chive is using already offers some pre-clustering in form of attempting
to actually assigning Unicode scalar values to the images, so I did this
"among all 'g's" first, so to speak, and then for the "this image is too
different from all other 'g's" images, i.e., clusters with only one mem-
ber after doing the 99.5% merging step.

The book has about 900 000 images of (supposed) characters. This simple
process (roughly, merge all clusters that are 99.5% similar on avergage,
then merge all clusters that are 98% similar on average) left me with a-
bout 700 clusters with 50 or more members (the book has 532 pages, so a
smaller cluster would have images that occur at most once every 10 pages
on average) and something like 30 000 clusters with fewer members.

Then I made a simple tool that lets me re-classify or confirm the OCR
software's guess as to which character a cluster of images is, with the
initial suggestion being the "dominant" character in a cluster (take a
pair like "t" and "l" which can easily be confused; if a cluster is, by
the OCR software's reckoning, mostly "t"s, my tool would suggest the
cluster is all "t"s) and fixed some things. For instance, the Internet
Archive's software thinks there are no "a"s with accent or grave or ma-
cron, so these are all "a"s, "ä"s, or something else. That's easily
fixed in this way. There is also the option in my tool to say that some
image cluster is too broken to assign a character, or that I am unsure
my guess is correct.

I did this for around 2000 clusters, that's quite easy and quickly done,
and then made another tool that shows actual words from the book so I
can check my re-assignments in context, particularily the assignments I
had been unsure about. The main thing to take away from that is that the
Internet Archive's software is not that good at splitting a word image
into character images, say, it would split a cursive "N" into an "i" and
a "V"; if you google for "deiVoniMT" right now the only hit would be the
plain text version of the book, though it's actually "de Nánner". There
it also confuses "á" and "o", that is partly because it ignores the "´",
it's not actually within the boundaries of the grapheme image, and the
"ne" it splits into something looking like a dotless "i" and a dotless
"i" followed by "e", except that it splits the "e" in the middle, and
you end up with rubbish like "deiVoniMT" (vs "de Nánner").

Anyway, this fixed up around 4% of the book, with only around 3% of the
images to go, the manual work to verify my re-assignments being redun-
dant, that's something like 20 minutes of human labour investment, if
you take the tools for granted, and in proper working order.

In general the manual step should not be very necessary if I had a good
wordlist for the languages involved, if you have, say, "Gänseridi", but
are unsure, because some of the clusters involved are very rare, about
the "di" (which is split into something looking like "cl" followed by a
dotless "i") you could grep it for "Gänseri" and would probably find the
only hit that fits is "Gänserich" (a gander) and go with that, but that
is just what I don't have.

The most promising next step seems to be looking at the wrong splits, in
the particular example I could re-combine "cl" and the "dotless 'i'" and
compare it to "ch" instances, for instance. That might generally be a
good idea, to look at bi-grams rather than unigrams, which would also be
very helpful with overlaps like the "Lj" example earlier... 

But in any case, the principial idea here, that you could "cluster" many
of the errors easily to correct them cheaply manually, seems to be quite
sound so far. From the proof-of-concept perspective this is a big win,
even though from the product-development perspective it is somewhat de-
pressing, as I will never release much of my tool chain or the data, ex-
cept my general approach and possibly end-results, should I acutally ma-
nage to get a properly OCR'd version of the book out of this.

If I do, I would then look into how to turn the book into something that
might be considered a dictionary. From there it would be finding a way
to match it to modern orthography, and with or without that, I'll like-
ly look into turning this into some form of e-book, for the book as a
whole, and into some Android application, for any dictionary part that
might come out of this, so all the dozens of people interested in my se-
mi native language have something to download. More relevantly to my own
interest would be folding this into some etymological graph of words a-
cross all the ingvaeonic languages...
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 14 May 2012 02:41:51 UTC