- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 26 Apr 2012 22:23:59 +0200
- To: www-archive@w3.org
- Message-ID: <ou6jp7944a40abn86fmj8hp8vambnarivo@hive.bjoern.hoehrmann.de>
So, Today we start again with https://gist.github.com/2395307 stripped of the parts that try to merge plain text. A simple measure to compute the distance between two glyph images that works well with the data is to change the extent of both images to the maximum width and height among the two images, compose their difference, and then apply a morphology filter, to get rid of mostly irrelevant artifacts, in my case I've used the Erode filter that ImageMagick supports with a Diamond:1 kernel, a plus sign essentially. This time I made no attempt to align the images, they are simply placed in the top left corner. I've attached a simple example that shows for a couple of characters the eroded difference image and alonside the percentage of black pixels in it. The percentages above 97.5 are highlighted in green if you excuse me using a <span> with a background color instead of a <strong> or <em> or some other more appropriate element. This turns out to be a fairly good measure, so I ran this over the entire book. For each character the Internet Archive's OCR software thought it saw I made a group (so one group for "ä" and one for "Q" and so on), and then I would check for each such character whether it's similar to an image that's already in the group, and if not, I would the the new image into the group, otherwise I simply count the character as another instance of one seen earlier. Similar here means that at most 2% of the pixels in the extended (to their smallest common multiple dimensions) difference are not black. Every 5_000 characters I would throw out from all groups the characters instances of which have only been seen once so far; that is a minor optimization to make the script complete within a few hours; I could have used other measures, but that seemed to work well enough. The result is for each character a set of images all supposedly repre- sent the character but are slightly different from each other, a simple example would be a latin small letter "k" at font size 10, another one at font size 20, and maybe another one at font size 10 but cursive in- stead of regular. In practise you have many more due to smears and OCR errors, but the idea is the same. Now, a regular "k" with some smears is more similar to other regular "k" representations than to a cursive "k", and cursive "h"s that have been misidentified as "k"s would again be a bit different. So I compared all the "k"s to each other and put them into a graph. All "representatives" are a node in the graph, the node weight is the number of times an in- stance of it has been seen in the book, and between any two "k"s there's an edge if at least 96% of the pixels in the eroded difference image are black and the edge weight is the percentage of black pixels in it. I've used 96% because the measure has to be slightly more tolerant than the earlier measure, otherwise there would be no edges (since all images at an error rate of 2% or less would be instances of the representatives). I then used my http://search.cpan.org/dist/Graph-NewmanGirvan/ module to cluster the nodes in the graph and put the result into a table. That's: http://www.websitedev.de/temp/clustered-graphemes.html.gz There you have which character the OCR software thought the images are, the number of instances belonging to the cluster, and then each of the representative images along with the individual instance count that be- long to a certain cluster. If you take a look at the "ö" part of it, you can see there are clusters for "regular ö", "cursive ö", "o with macron" (misidentified as "ö"), "o with accent" (also misidentified), very small "ö"s that appear in the Google boilerplate in front of the actual book, gross errors like "fö" and "a" and "o" misidentified as "ö", and so on. For the clustering I've ignored representatives that have less than 3 instances, just like I filtered out very rare representatives earlier. Also note that this is based on the tight bounding boxes if the grapheme images, I so far ignore their position relative to the baseline, so you get some odd results for commas versus apostrophes and issues like that. A notable problem is that many of the images are not cleanly separated, cursive "ij" for instance cannot be split into two rectangles as they overlap, but there is a bit of "dirt" on the edges in many of the images due to issues like that. I suspect this would work a lot better if I had a good method to remove that. I could make a simple solution like using a 5x5 block to go over the image along the edges and, if all the pixels on the edge are black, and all pixels on the opposing side of the block are white, make all the pixels white, but so far I am not motivated to do that. Another obvious and anticipated problem is of course that this entirely ignores characteristic differences; the difference between "ä" and "á" is usually in a very specific position, but my distance measure uses all differences anywhere in the two images and considers them equally im- portant, same for "l" and "t" and "u" and "n", and just like the input classification makes many mistakes in that area, so does my code so far. Since the OCR software considers "ä" and "á" to be the same character, and my code has no knowledge of those two being different either, there would have to be some additional external input to tell these apart, at least to do a better job than this currently does. There are many errors in the table that seem obvious, so a next step is to try finding those. For instance some "ö"s and some "ó"s are identi- fied as "d"s, but they should be more similar to other "ö"s and "ó"s, so computing the differences between the representatives in the table would be good. Unfortunately brute force is not a good option for that, there are over 4000 representatives so O(n²/2) is not very attractive. It'd be helpful to have "b is often confused with h"-style data, but I'd rather not use external data for this at this point. The quadratic comparison would also yield such data... Italics are another thing of interest, if I could identify those, there would be a lot less need for human input to correct many of the errors. I would like to retain that information and I suppose the eventual idea is to go through a few hundred of characters and manually correct them, and then do a final pass over the book with better reference data. That would then inform how to address remaining errors, like the differences between cursive "l" and cursive "t" possibly with other measures I've http://lists.w3.org/Archives/Public/www-archive/2012Apr/0044.html in- vestigated, like the n-gram measure. Though in this particular case, I'd think that needs some manual guidance, as the language of interest here isn't quite modern high german (but rather a mixture of various north- sea germanic languages in 19th century ortography). (A particular concern is the treatment of "ij", by the looks of it, the Internet Archive's OCR software might be treating the two as a single character, and fixing that might require extraordinary measures. Some of the images also look like they are glyphs like "M" split in the middle, causing a similar problem.) regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments
- text/html attachment: aes-eroded-differences.html
Received on Thursday, 26 April 2012 20:24:36 UTC