Procrastination always pays off Now. from Bjoern Hoehrmann on 2012-04-27 (www-archive@w3.org from April 2012)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 28 Apr 2012 00:41:23 +0200
To: www-archive@w3.org
Message-ID: <554mp794fk3r4vp2qu82kuuikemhvp9l8r@hive.bjoern.hoehrmann.de>
Now,

  Programming is really about automation and automation is really about
avoiding unnecessary or otherwise undesirable labour by sentient beings.
I usually try to keep that in mind, and as a result I tend to prefer so-
lutions that work "always" and not just in "99%" of the cases. I remem-
ber back in the 1990s on Usenet encountering people who did not quite
think that way, there would be people for instance who weren't bothered
by, say, some PHP or Perl CGI script that doesn't escape all free-form
text when generating some HTML document or some SQL query, and at that
time it didn't bother me so much for security reasons, but because the
code doesn't "always" work right. And it didn't tend to be easy to con-
vince people to fix their code. Same would go for declaring the encoding
of your documents and any number of things.

Correspondingly I found it odd when a couple of years later everybody
was suddenly concerned about SQL injections and XSS vulnerabilities, I
kinda thought everybody was aware that all the code out there is crap,
full of things that seem to work 99% of the time, but not always. I've
felt much more at home with, say, the GNU Coding Standards, which say,
in section 4.2 "Writing Robust Programs" at the very beginning:

  Avoid arbitrary limits on the length or number of any data structure,
  including file names, lines, files, and symbols, by allocating all
  data structures dynamically. In most Unix utilities, “long lines are
  silently truncated”. This is not acceptable in a GNU utility.

I am also down and dirty with quick and dirty, but "effective" is not
just some aspect of "efficient", it's a different dimension.

Going back to http://www.websitedev.de/temp/clustered-graphemes.html.gz
the problem was that the Internet Archive's OCR software identifies some
characters incorrectly. I assumed that many of the errors should be easy
to identify using the trivial metric I've developed, but comparing all
the characters to one another is O(n²) which isn't really a problem for
a single book, I could have gathered much of the data during the initial
run making the document above, but it's not really satisfactory.

So I've been thinking about how to avoid certain comparisons, which is a
form of machine procrastination I suppose, if one day we develop some-
thing resembling a sentient robot I suppose we would still want to find
new ways how they, too, can avoid work, while also minimizing the effort
I myself spend on the problem. Now, the easy solution would be avoiding
the problem and simply manually go through the table and identify all of
the characters; I did that, it takes about 20 minutes, 30 if you have to
copy and paste various non-ASCII characters all the time, but that does
not really help if you want to fix OCR errors in a second or third book.

After some thinking it occured to me that I don't actually have to find
any way to avoid certain comparisons. It's good enough to pick the low-
hanging fruits and then go back procrastinating. And mind you, that does
not contradict making a solution that works always, within the limits of
the methods I have chosen.

In each of the images are a certain number of black pixels, and similar
images would have similar amounts of black pixels. So you would sort the
images based on the number of black pixels and then compare neighbouring
images in the list. You would still need some cut-off metric to avoid
comparing all other images if there is no good-enough match if you do it
depth-first, but it's clear that the bigger the black pixel difference
is, the less likely are good matches, so you do it breadth-first. So,

  foreach my $offset (-1, 1, -2, 2, -3 ...) {
    foreach my $image (@images_ordered_by_black_pixel_count) {
      ...
    }
  }

Sadly you cannot write it quite like that as the index is lost with for-
each loops, you only get the item, but you get the idea. The low-hanging
fruits are then what you choose them to be; the script would accept good
enough solutions, so it's somewhat less bad than O(n²/2) on average, and
the longer the script runs the harder it will be to find new matches, so
you might decide to terminate it at some point, or you might decide to
let it run to completion. This allows for making up internal statistics,
say "It's been an hour since starting the script, and in that time we've
found 500 matches, but in the last 10 minutes we've only found 1 match,
so we probably should give up about now."

That's as opposed to using some external statistic, with a depth-first
search you might have a condition like "skip images that are twice as
big as the one we are trying to match" if you've found that to be a good
cut-off mark. That's external because you would have to analyze entire
statistical populations to define what's "good". Well strictly speaking,
the time/rate measure in the example above assumes that you would not
find 1000 matches in the next hour, but the argument for why that isn't
very likely is much simpler than going from image dimensions, if you've
got a 50x50 image and add a pixel at 100,100 you would make the bounding
box much bigger but the number of black pixels probably doesn't change
all that much.

As for me personally, I already like this approach better because I can
monitor the progress; this is especially important because this is ad-
hoc work, I don't have to wait for a long time to get results that I can
review, then find I made some mistake and have to do it all over again,
I can see that pretty much immediately. I've attached what the script
has generated while writing this mail; numbers are the ratio of black
pixels in the diamond:1 eroded difference image that covers both images.
Note that the matches are only among characters that the OCR software
felt are different characters. There are about 4900 images in the table,
and so far the script has identified around 600 misidentifications, or
about 12% (note that some are wrong, 'e' and 'c' are easily confused for
instance).

When the script is done I'll look into how to re-cluster the images into
something that can be studied for any problems that are left, beyond the
known ones like the confusion between 't' and 'l' in the "font" the book
is using.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments

text/html attachment: breadth-first-progress.html
Received on Friday, 27 April 2012 22:41:59 UTC