Machine-guided clustering of similar glyph images

So,

  Today we start again with https://gist.github.com/2395307 stripped of
the parts that try to merge plain text. A simple measure to compute the
distance between two glyph images that works well with the data is to
change the extent of both images to the maximum width and height among
the two images, compose their difference, and then apply a morphology
filter, to get rid of mostly irrelevant artifacts, in my case I've used
the Erode filter that ImageMagick supports with a Diamond:1 kernel, a
plus sign essentially. This time I made no attempt to align the images,
they are simply placed in the top left corner.

I've attached a simple example that shows for a couple of characters the
eroded difference image and alonside the percentage of black pixels in
it. The percentages above 97.5 are highlighted in green if you excuse me
using a <span> with a background color instead of a <strong> or <em> or
some other more appropriate element. This turns out to be a fairly good
measure, so I ran this over the entire book.

For each character the Internet Archive's OCR software thought it saw I
made a group (so one group for "ä" and one for "Q" and so on), and then
I would check for each such character whether it's similar to an image
that's already in the group, and if not, I would the the new image into
the group, otherwise I simply count the character as another instance of
one seen earlier. Similar here means that at most 2% of the pixels in
the extended (to their smallest common multiple dimensions) difference
are not black. Every 5_000 characters I would throw out from all groups
the characters instances of which have only been seen once so far; that
is a minor optimization to make the script complete within a few hours;
I could have used other measures, but that seemed to work well enough.

The result is for each character a set of images all supposedly repre-
sent the character but are slightly different from each other, a simple
example would be a latin small letter "k" at font size 10, another one
at font size 20, and maybe another one at font size 10 but cursive in-
stead of regular. In practise you have many more due to smears and OCR
errors, but the idea is the same.

Now, a regular "k" with some smears is more similar to other regular "k"
representations than to a cursive "k", and cursive "h"s that have been
misidentified as "k"s would again be a bit different. So I compared all
the "k"s to each other and put them into a graph. All "representatives"
are a node in the graph, the node weight is the number of times an in-
stance of it has been seen in the book, and between any two "k"s there's
an edge if at least 96% of the pixels in the eroded difference image are
black and the edge weight is the percentage of black pixels in it. I've
used 96% because the measure has to be slightly more tolerant than the
earlier measure, otherwise there would be no edges (since all images at
an error rate of 2% or less would be instances of the representatives).

I then used my http://search.cpan.org/dist/Graph-NewmanGirvan/ module to
cluster the nodes in the graph and put the result into a table. That's:

  http://www.websitedev.de/temp/clustered-graphemes.html.gz

There you have which character the OCR software thought the images are,
the number of instances belonging to the cluster, and then each of the
representative images along with the individual instance count that be-
long to a certain cluster. If you take a look at the "ö" part of it, you
can see there are clusters for "regular ö", "cursive ö", "o with macron"
(misidentified as "ö"), "o with accent" (also misidentified), very small
"ö"s that appear in the Google boilerplate in front of the actual book,
gross errors like "fö" and "a" and "o" misidentified as "ö", and so on.

For the clustering I've ignored representatives that have less than 3
instances, just like I filtered out very rare representatives earlier.
Also note that this is based on the tight bounding boxes if the grapheme
images, I so far ignore their position relative to the baseline, so you
get some odd results for commas versus apostrophes and issues like that.

A notable problem is that many of the images are not cleanly separated,
cursive "ij" for instance cannot be split into two rectangles as they
overlap, but there is a bit of "dirt" on the edges in many of the images
due to issues like that. I suspect this would work a lot better if I had
a good method to remove that. I could make a simple solution like using
a 5x5 block to go over the image along the edges and, if all the pixels
on the edge are black, and all pixels on the opposing side of the block
are white, make all the pixels white, but so far I am not motivated to
do that.

Another obvious and anticipated problem is of course that this entirely
ignores characteristic differences; the difference between "ä" and "á"
is usually in a very specific position, but my distance measure uses all
differences anywhere in the two images and considers them equally im-
portant, same for "l" and "t" and "u" and "n", and just like the input
classification makes many mistakes in that area, so does my code so far.
Since the OCR software considers "ä" and "á" to be the same character,
and my code has no knowledge of those two being different either, there
would have to be some additional external input to tell these apart, at
least to do a better job than this currently does.

There are many errors in the table that seem obvious, so a next step is
to try finding those. For instance some "ö"s and some "ó"s are identi-
fied as "d"s, but they should be more similar to other "ö"s and "ó"s, so
computing the differences between the representatives in the table would
be good. Unfortunately brute force is not a good option for that, there
are over 4000 representatives so O(n²/2) is not very attractive. It'd be
helpful to have "b is often confused with h"-style data, but I'd rather
not use external data for this at this point. The quadratic comparison
would also yield such data...

Italics are another thing of interest, if I could identify those, there
would be a lot less need for human input to correct many of the errors.
I would like to retain that information and I suppose the eventual idea
is to go through a few hundred of characters and manually correct them,
and then do a final pass over the book with better reference data. That
would then inform how to address remaining errors, like the differences
between cursive "l" and cursive "t" possibly with other measures I've
http://lists.w3.org/Archives/Public/www-archive/2012Apr/0044.html in-
vestigated, like the n-gram measure. Though in this particular case, I'd
think that needs some manual guidance, as the language of interest here
isn't quite modern high german (but rather a mixture of various north-
sea germanic languages in 19th century ortography).

(A particular concern is the treatment of "ij", by the looks of it, the
Internet Archive's OCR software might be treating the two as a single
character, and fixing that might require extraordinary measures. Some of
the images also look like they are glyphs like "M" split in the middle,
causing a similar problem.)

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Thursday, 26 April 2012 20:24:36 UTC