W3C home > Mailing lists > Public > www-archive@w3.org > April 2012

Grave condition: macron suffers acute diaeresis...

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 25 Apr 2012 01:32:37 +0200
To: www-archive@w3.org
Message-ID: <0bcep7d1h6hj1to0885k0ljeb53gimsgci@hive.bjoern.hoehrmann.de>

  I picked some page and using https://gist.github.com/2395307 - after I
stripped out the parts to merge the OCR data with other OCR data - I've
extracted all the shapes the Internet Archive software considers to be a
"ä" and compared all instances to each other. Actually I started with a
http://www.websitedev.de/temp/d-stddev-inverted.png computed image of ds
on the page, each pixel in the image is proportional to the standard de-
viation of all pixels at that position across all "d" shapes. Looking at
that and similar results I figured I have to account a little bit for
the minor differences among the monochrome shapes due to rounding errors
and similar issues. The dark parts in the image basically indicate that
there is a good bit of fluctuation at those points, while white parts
indicate stability, they are more or less always the same colour.

Now, aligning two images like the images here should be a solved problem
but apparently implementations for that are hard to come by, so I made a
simple brute force script that takes two images and compares them with
an offset of at most three pixels and among all the options pickes the
one that produces the least "white" difference image. For illustration I
also manually ordered the shapes according to what I find similar shapes
and the attached document shows the individual shapes in a comparison
matrix with the corresponding difference image after the slight adjust-
ment. The matrix should be relatively dark around the diagonal; it might
take some getting used to it, but generally at least the top left and
the bottom right corner should form fairly obvious clusters, depsite the
very primitive comparison logic involved here.

It should also be fairly clear that a cursive 'a with macron' is nothing
like a regular 'a with diaeresis'.
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Tuesday, 24 April 2012 23:33:09 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:44:03 UTC