- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 14 May 2012 04:41:24 +0200
- To: www-archive@w3.org
So, How hard can it be? I tend to be driven by that, which may not be http://tinyurl.com/6rrjskn very surprising as I seem to have grown up with that question; http://cutycapt.sf.net/ for instance exist main- ly because people kept asking me for a http://iecapt.sf.net for Linux, so I sat down some weekend, looking through the Qt documentation, I confirmed that it seems to expose all the needed functionality, then I configured and built Qt, encountering various bugs in Qt and reporting them, coded up something that seemed to work, did a little cleanup, registered a SourceForge project for it, made a brief web page for it, installed a recent Ubuntu version in a virtual machine, tried to build it there, fixed the bugs that showed up, uploaded it all, and then I could answer the question: despite having no experience with Qt, it's not really hard. That isn't without problems, even though I found CutyCapt useful in a number of situations since, my interest really ended there, and I've not done much to maintain CutyCapt past that proof-of-concept stage, which probably discouraged others for some time to roll their own and maintain it with more interest than I do. Well, I am not so worried about that actually, what I find more unfortunate was that I did not really think or care about how interested people were going to be in CutyCapt, if I had, I would have done a number of things differently, in pseudo-commercial terms, I would have cared more about branding, about setting up some community, recruiting some development help by way of such a community, paying more attention to bug reports and the various people sent me... You know, back in school, people were turn- ing 18 years old or whatever, and friends would want to do something special, like presenting a custom drink mixed according to some "fa- mily recipe", and I'd say this needs a proper custom brand name, you need some logo design on the bottle, a good slogan, some custom story on the backside, make it an event, and then we sat down and did just that. So instead of "Do you remember when we made that custom drink and made it a present to someone or other?" "Oh yeah, that was nice, and someone or other liked it." you actually got something resembling a story about the whole thing. Right, the idea being that you actually made a finished product, not just a proof-of-concept, not something purely ad-hoc, not just the pure technicality of "we mixed you a drink", but a bottle with a backside explaining Nach alter nordfriesischer Tradition wird zwischen Dagebüll und Fah- retoft das DAGEBÜLLER WASSER täglich frisch der Nordsee entnommen... Anyway, optical character recognition has been a recent "how hard can it be" thing for me, and I wanted to record some notes on that. The art, in its digital form, is something like 80 years old, so one would think it is a solved problem by now, but given that neither Google nor the Inter- net Archive can provide plain text versions of public domain books they distribute that are in any way readable, at least not the ones I care a- bout, someone has to answer the question. Actually, my underlying motivation is somewhat different. Let's say you want to preserve a dying language, what would you do? Well, death would be having no speakers anymore (and no writers really, otherwise you'd have to consider modern high german, sometimes misleading called Stan- dard German, to have been dead through much of its life, since it has been a written language, not so much a spoken one, for much of its life) so you would want more speakers, more writers, more readers, and to get that, you would need education, and education depends on having access to relevant educational information, on a very basic level that would mean things like a dictionary. Right, if you ask your favourite search engine about a latin word you do not know, it will probably come up with texts that use it, would link to dictionaries, might even offer to translate it, and so on. It might find tools that analyze words morphologically, perhaps even spell checkers, a whole ecosystem of useful resources. That's how it should work, right, but not all languages are so lucky. In my particular case your favourite search engine would usually turn up no results unless it's an extremely basic word or if some other language is using the "same" word. There is no dictionary online for my semi-native language, there are only exactly two "wordlists" with poor coverage and unclear "licensing terms", and the bigger of the of the two maps between my language and a related one with similar problems. That language in turn does have online dictionaries, but they are somewhere between very insufficient and very unfree making them not very useful. The most use- ful one maps to a language with around 30 million speakers, so if you wanted to map the wordlist into, say, english, you would have to go over three hops, which would likely produce horrible results, and you could not publish the result either, at least not without a serious talk with a qualified lawyer. Now there are "dictionaries" in wood media form, and I use the plural lightly here, but they are not online with proper licensing terms, so either nobody knows how to do that, nobody had the idea, or there are knowledge hoarding laws that stand in the way. I don't care which it is, but there is one exception in the form of a public domain book. It does not use modern orthography, but how hard can it be to map between its style and more modern forms? I mean to answer that, so... Optical character recognition. As noted, the Internet Archive makes its scanning data available in forms that allow for post-processing, and as there are many errors in the plain text versions that should not really happen, given that the graphemes in the relevant book look very much a- like, I figured it might be possible to use some supervised technique to fix most of the errors. The general idea is that you would compare the graphemes to each other and use some sort of clustering algorithm to group them, so supervisors can correct many errors with little effort. For that I came up with a simple similarity measure. Basically you make two scan images the same size, then run something like ImageMagick's `convert` ala % convert a.png b.png -compose difference -composite -morphology erode diamond:1 out.png and then determine the ratio of black pixels in the output image. This would compute the absolute difference between all the pixels (`xor` the pixels as the images are monochrome) and then "erode" the result. With the "diamond:1" kernel, that means a pixel is made black if any of the pixels directly above, below, right, or left are black. To illustrate, if the difference between two images is black with a single white line, which might happen if you compare two "I"s, one offset by one pixel, or "thicker" by one pixel, the entire line would disappear, because all the pixels have a black neighbour. On the other hand, if you, say, compare a dotless "i" with an "i", the dot would probably be more like a circle, let's say a 3x3 white square, in the difference, where this would retain the white pixel in the middle, as all neighbours are also white. I have found that if the amount of white pixels in the difference is a- round 0.5%, you can be fairly certain that they are really "the same", so you put images like that into the same cluster directly. That's pretty strict though, you would end up with too many clusters, a particular problem being that sometimes there is some overlap, consider in italics a string like "Lj", the "L" might extend a bit into the "j", and the "j" might extend a bit into the "L", if you cut them into rect- angles. So after this strict clustering, you may want to merge clusters that "on average" are quite similar. The measure I've used is that, if you randomly pick 10 images from two clusters, that is, 10 from each giving 10 pairs to compare, and if they are on average 98% or more "similar" (meaning 2% or less white pixels in the difference image as discussed above), you can safely merge them. I did this first with the bigger clusters to keep the number of compari- sons down and the number of members high (so it's actually hard for very "rare" characters to become part of bigger clusters), and then for the very small clusters. I exploited that the OCR software the Internet Ar- chive is using already offers some pre-clustering in form of attempting to actually assigning Unicode scalar values to the images, so I did this "among all 'g's" first, so to speak, and then for the "this image is too different from all other 'g's" images, i.e., clusters with only one mem- ber after doing the 99.5% merging step. The book has about 900 000 images of (supposed) characters. This simple process (roughly, merge all clusters that are 99.5% similar on avergage, then merge all clusters that are 98% similar on average) left me with a- bout 700 clusters with 50 or more members (the book has 532 pages, so a smaller cluster would have images that occur at most once every 10 pages on average) and something like 30 000 clusters with fewer members. Then I made a simple tool that lets me re-classify or confirm the OCR software's guess as to which character a cluster of images is, with the initial suggestion being the "dominant" character in a cluster (take a pair like "t" and "l" which can easily be confused; if a cluster is, by the OCR software's reckoning, mostly "t"s, my tool would suggest the cluster is all "t"s) and fixed some things. For instance, the Internet Archive's software thinks there are no "a"s with accent or grave or ma- cron, so these are all "a"s, "ä"s, or something else. That's easily fixed in this way. There is also the option in my tool to say that some image cluster is too broken to assign a character, or that I am unsure my guess is correct. I did this for around 2000 clusters, that's quite easy and quickly done, and then made another tool that shows actual words from the book so I can check my re-assignments in context, particularily the assignments I had been unsure about. The main thing to take away from that is that the Internet Archive's software is not that good at splitting a word image into character images, say, it would split a cursive "N" into an "i" and a "V"; if you google for "deiVoniMT" right now the only hit would be the plain text version of the book, though it's actually "de Nánner". There it also confuses "á" and "o", that is partly because it ignores the "´", it's not actually within the boundaries of the grapheme image, and the "ne" it splits into something looking like a dotless "i" and a dotless "i" followed by "e", except that it splits the "e" in the middle, and you end up with rubbish like "deiVoniMT" (vs "de Nánner"). Anyway, this fixed up around 4% of the book, with only around 3% of the images to go, the manual work to verify my re-assignments being redun- dant, that's something like 20 minutes of human labour investment, if you take the tools for granted, and in proper working order. In general the manual step should not be very necessary if I had a good wordlist for the languages involved, if you have, say, "Gänseridi", but are unsure, because some of the clusters involved are very rare, about the "di" (which is split into something looking like "cl" followed by a dotless "i") you could grep it for "Gänseri" and would probably find the only hit that fits is "Gänserich" (a gander) and go with that, but that is just what I don't have. The most promising next step seems to be looking at the wrong splits, in the particular example I could re-combine "cl" and the "dotless 'i'" and compare it to "ch" instances, for instance. That might generally be a good idea, to look at bi-grams rather than unigrams, which would also be very helpful with overlaps like the "Lj" example earlier... But in any case, the principial idea here, that you could "cluster" many of the errors easily to correct them cheaply manually, seems to be quite sound so far. From the proof-of-concept perspective this is a big win, even though from the product-development perspective it is somewhat de- pressing, as I will never release much of my tool chain or the data, ex- cept my general approach and possibly end-results, should I acutally ma- nage to get a properly OCR'd version of the book out of this. If I do, I would then look into how to turn the book into something that might be considered a dictionary. From there it would be finding a way to match it to modern orthography, and with or without that, I'll like- ly look into turning this into some form of e-book, for the book as a whole, and into some Android application, for any dictionary part that might come out of this, so all the dozens of people interested in my se- mi native language have something to download. More relevantly to my own interest would be folding this into some etymological graph of words a- cross all the ingvaeonic languages... -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 14 May 2012 02:41:51 UTC