W3C home > Mailing lists > Public > www-archive@w3.org > April 2012

Inline tables for better diffs

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 10 Apr 2012 01:37:39 +0200
To: www-archive@w3.org
Message-ID: <hrp6o7djjfsbhm0o74lkkpvgof4f6cikta@hive.bjoern.hoehrmann.de>

  The Internet Archive hosts various public domain books that have been
digitalized by Google. That's not terribly useful really, usually they
tend to be difficult to read, and the OCR employed does not give good
results at least for the books that interest me. But it turns out they
use different OCR engines and they perform poorly in different ways! So
I wondered how they compared and what could be done to recover from all
the errors, other than using yet more existing OCR software, which isn't
likely to produce better results.

As a first approximation directly comparing some examples seemed a good
idea, so I took some interesting excerpt and ran it through diff tools.
Diff tools of course tend to be as advanced as you can witness, say, on
Wikipedia, where you get the old text on one side and the new text on
the other side, with "differences" highlighted in some form, and then it
is up to you actually compare what's being highlighted, which I've found
to be rather involved.

For me http://lists.w3.org/Archives/Public/www-archive/2012Mar/0033.html
it would be better if I actually had the differences directly adjacent
while the unchanged text runs as usual, but the Ruby approach that works
well for transliterating russian text, here we really have forks that
represent essentially alternatives, so it needs a bit of a different vi-
sual style. So I wrote a script using Perl's Algorithm::Diff 'sdiff' sub
routine that generates inline tables for the alternatives.

I've attached a sample rendering. (Quick hint as an aside, if you want
to make sure you actually attach attachments, write you have attached
them, then actually attach them, and then write an additional comment on
what it shows or something along that line.)

Most of these errors could be corrected simply by looking which words
occur more frequently in texts of this language from the period (and I
expect the same would hold for n-grams that make up the words, you'll
find "ö" much more frequently than "ô" for instance, and "usuiuer" is
not something that you would expect in a german word of the era either).

There are other problems that should be easy to detect, like the many
"friesisch" variant misrecognitions, having "Fortoft" and "Fartnft" so
close next to each other should also be enough of a hint that you ought
to spend more time analysing, not to mention that morphological analysis
should easily fix the "lind", "find" issue in the first sentence (which
is really a "sind", at the end of a sentence, neither of the results are
likely, and certainly not grammatically correct, in german text.

The Internet Archive data includes reports from the OCR software where
it recognized each character along with the image data and the DJVU data
has the same for words. There are some patterns to the errors, like the
frequent misinterpretation of cursive (italicized) "ll" in names, which
occurs a number of times in the relevant book (as part of "büll", a to-
ponymical suffix indicating a settlements, dwellings). I would like to
check whether you can exploit the size of the corpus (a whole book) to
avoid this kind of error through clustering graphemes, the idea being
something along the lines of, if you have a lot of cursive "l"s and many
non-cursive "l"s, and one cluster is collectively quite different from
the other, you might want to treat either both clusters as "l" or one of
the clusters as never being "l"s, avoiding issues like "büil" if the
data allows for that.

(Here you would check again that the attachment is indeed present.)

Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

(image/png attachment: diff.png)

Received on Monday, 9 April 2012 23:38:07 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:44:03 UTC