record linkage in SIMILE from Butler, Mark on 2004-04-20 (www-rdf-dspace@w3.org from April 2004)

From: Butler, Mark <mark-h.butler@hp.com>
Date: Tue, 20 Apr 2004 13:49:59 +0100
To: SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E808ED2337@0-mail-br1.hpl.hp.com>
Hi team,

I have received an email from Nick Matsakis - he is interested in record
linkage, and whether there is a potential SIMILE use case. He writes

> I view this problem as having two primary aspects: the algorithmic and 
> the social.  The algorithmic aspect is the easy one for computer 
> scientists to
> understand: what techniques are used to find corresponding entries across
> datasets? (e.g. edit distances between strings, probabilistic models,
> etc.)
>
> However, the social aspect is the more important one.  Where is the 
> data coming from? Who is managing it? How much effort can be expended 
> linking or reviewing linked records, and what are the costs of 
> incorrect or incomplete linkage?  It is only in the context of the 
> answers to these questions that competing algorithms can be evaluated.
>
> With these thoughts in mind, it seems reasonable to begin by writing a 
> use case scenario that focuses on this issue.  Then, a suitable 
> dataset can be chosen that is both representative and illustrative of 
> this problem. Finally, the algorithmic work can be applied to the 
> dataset.

I've started a SIMILE WIKI entry on this at
http://simile.mit.edu/wiki/DatasetLinkage but I guess MacKenzie and Eric may
have some interesting comments here. 

Basically, at the moment we use the Levenshtein edit distance to match
labels in different collections. We also look up people in the OCLC
authority server. Here if a we get back more than a single match, we try to
disambiguate using year of birth and year of death (if that information is
present, it may not be). 

I'm not sure of the edit distance measure that the OCLC server uses, but it
must use one as it returns inexact matches. 

We also look up people in the Wikipedia. Sometimes this is helpful for
disambiguation, as the Wikipedia contains text which can be used to confirm
we have the right individual (e.g. the artist as opposed to the mass murder)
and it may contain year of birth / year of death, although there is no
obvious heuristic to automatically extract that information via
screenscraping for OCLC disambiguation so it must be done by hand. We
automatically sort records with ambiguous matches from those that appear to
have unambiguous matches to speed up human review, but really a human should
review all computer generated matches for errors. 

The Levenshtein edit distance is not ideal, because we are really looking to
for an ad-hoc stemming algorithm than for catching typing errors. With
Levenshtein though corpse and corpses are close together (which is good) but
so are mountain and fountain (which is bad). I'm trying to gather some
resources on this, see
http://simile.mit.edu/wiki/EditDistances

However answering Nick's second set of questions is more difficult. 

> Where is the data coming from? 

Well at the moment, Artstor, OCW, LOC, and being screenscraped from the
Wikipedia - see 
http://simile.mit.edu/wiki/DatasetsWeCurrentlyUse

> Who is managing it? 

We are, but we are not really in the business of managing the data. The
approach we have adopted is to have a canonical dataset (the original XML)
then we have a transform we can run multiple times to get the data in the
form we want.

The situation for record linkage is slightly different. We generate a set of
matches, that then need to be reviewed by hand. I've paid some attention to
try reduce the human burden e.g. using heuristics to try to pick the best
match automatically and sorting so the human can review the worst records
first. Once they have reviewed it, they copy the corrected information by
hand so it can be used in the application. 

One possible future improvement would be to store the human's decisions, so
they only need to make them once, and those decisions are retained for
multiple runs of the algorithm, a bit like adding words to a dictionary in a
word processor so you don't get reminded about them each time you run the
spell checker. 

> How much effort can be expended linking or reviewing linked records

We want to minimize this, although the automated procedure we have now is a
lot better than the by hand approach we used for the first demo. Its not
ideal though ... there are still matches we must be missing. 

> and what are the costs of incorrect or incomplete linkage?  

We need linkage as that is what we are interested in. So the more linkage we
can generate the better. However incorrect linkage confuses the user. It is
hard to assign a "cost" to this. 

For example if I have a query "find me all the artworks by Yoko Ono?" then
incomplete linkage might mean I might not get pieces pertaining to when Yoko
Ono was involved in Fluxus. For an example of incorrect linkage, the Artstor
collection features an artist called John Graham, but when we look that up
in Wikipedia it gets matched to John Gilbert Graham, the plane bomber who
killed 44 people with a bomb. I would say the latter is more acceptable than
the former, put thats a very subjective judgement, primarily due to the
unsavoury nature of the mismatch.

> It is only in the context of the 
> answers to these questions that competing algorithms can be evaluated.

Yes, but answering these questions may be hard - comments anyone?

One observation is we can dramatically improve our accuracy if we have two
disjoint pieces of information e.g. name and year of birth. 

For the current code, see Authorities.java and Distance.java in Longwell. 

Dr Mark H. Butler
Research Scientist, HP Labs Bristol
http://www-uk.hpl.hp.com/people/marbut
Received on Tuesday, 20 April 2004 08:50:34 UTC