- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sat, 28 Apr 2012 00:41:23 +0200
- To: www-archive@w3.org
- Message-ID: <554mp794fk3r4vp2qu82kuuikemhvp9l8r@hive.bjoern.hoehrmann.de>
Now, Programming is really about automation and automation is really about avoiding unnecessary or otherwise undesirable labour by sentient beings. I usually try to keep that in mind, and as a result I tend to prefer so- lutions that work "always" and not just in "99%" of the cases. I remem- ber back in the 1990s on Usenet encountering people who did not quite think that way, there would be people for instance who weren't bothered by, say, some PHP or Perl CGI script that doesn't escape all free-form text when generating some HTML document or some SQL query, and at that time it didn't bother me so much for security reasons, but because the code doesn't "always" work right. And it didn't tend to be easy to con- vince people to fix their code. Same would go for declaring the encoding of your documents and any number of things. Correspondingly I found it odd when a couple of years later everybody was suddenly concerned about SQL injections and XSS vulnerabilities, I kinda thought everybody was aware that all the code out there is crap, full of things that seem to work 99% of the time, but not always. I've felt much more at home with, say, the GNU Coding Standards, which say, in section 4.2 "Writing Robust Programs" at the very beginning: Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and symbols, by allocating all data structures dynamically. In most Unix utilities, “long lines are silently truncated”. This is not acceptable in a GNU utility. I am also down and dirty with quick and dirty, but "effective" is not just some aspect of "efficient", it's a different dimension. Going back to http://www.websitedev.de/temp/clustered-graphemes.html.gz the problem was that the Internet Archive's OCR software identifies some characters incorrectly. I assumed that many of the errors should be easy to identify using the trivial metric I've developed, but comparing all the characters to one another is O(n²) which isn't really a problem for a single book, I could have gathered much of the data during the initial run making the document above, but it's not really satisfactory. So I've been thinking about how to avoid certain comparisons, which is a form of machine procrastination I suppose, if one day we develop some- thing resembling a sentient robot I suppose we would still want to find new ways how they, too, can avoid work, while also minimizing the effort I myself spend on the problem. Now, the easy solution would be avoiding the problem and simply manually go through the table and identify all of the characters; I did that, it takes about 20 minutes, 30 if you have to copy and paste various non-ASCII characters all the time, but that does not really help if you want to fix OCR errors in a second or third book. After some thinking it occured to me that I don't actually have to find any way to avoid certain comparisons. It's good enough to pick the low- hanging fruits and then go back procrastinating. And mind you, that does not contradict making a solution that works always, within the limits of the methods I have chosen. In each of the images are a certain number of black pixels, and similar images would have similar amounts of black pixels. So you would sort the images based on the number of black pixels and then compare neighbouring images in the list. You would still need some cut-off metric to avoid comparing all other images if there is no good-enough match if you do it depth-first, but it's clear that the bigger the black pixel difference is, the less likely are good matches, so you do it breadth-first. So, foreach my $offset (-1, 1, -2, 2, -3 ...) { foreach my $image (@images_ordered_by_black_pixel_count) { ... } } Sadly you cannot write it quite like that as the index is lost with for- each loops, you only get the item, but you get the idea. The low-hanging fruits are then what you choose them to be; the script would accept good enough solutions, so it's somewhat less bad than O(n²/2) on average, and the longer the script runs the harder it will be to find new matches, so you might decide to terminate it at some point, or you might decide to let it run to completion. This allows for making up internal statistics, say "It's been an hour since starting the script, and in that time we've found 500 matches, but in the last 10 minutes we've only found 1 match, so we probably should give up about now." That's as opposed to using some external statistic, with a depth-first search you might have a condition like "skip images that are twice as big as the one we are trying to match" if you've found that to be a good cut-off mark. That's external because you would have to analyze entire statistical populations to define what's "good". Well strictly speaking, the time/rate measure in the example above assumes that you would not find 1000 matches in the next hour, but the argument for why that isn't very likely is much simpler than going from image dimensions, if you've got a 50x50 image and add a pixel at 100,100 you would make the bounding box much bigger but the number of black pixels probably doesn't change all that much. As for me personally, I already like this approach better because I can monitor the progress; this is especially important because this is ad- hoc work, I don't have to wait for a long time to get results that I can review, then find I made some mistake and have to do it all over again, I can see that pretty much immediately. I've attached what the script has generated while writing this mail; numbers are the ratio of black pixels in the diamond:1 eroded difference image that covers both images. Note that the matches are only among characters that the OCR software felt are different characters. There are about 4900 images in the table, and so far the script has identified around 600 misidentifications, or about 12% (note that some are wrong, 'e' and 'c' are easily confused for instance). When the script is done I'll look into how to re-cluster the images into something that can be studied for any problems that are left, beyond the known ones like the confusion between 't' and 'l' in the "font" the book is using. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments
- text/html attachment: breadth-first-progress.html
Received on Friday, 27 April 2012 22:41:59 UTC