- From: Butler, Mark <mark-h.butler@hp.com>
- Date: Tue, 20 Apr 2004 13:49:59 +0100
- To: SIMILE public list <www-rdf-dspace@w3.org>
Hi team, I have received an email from Nick Matsakis - he is interested in record linkage, and whether there is a potential SIMILE use case. He writes > I view this problem as having two primary aspects: the algorithmic and > the social. The algorithmic aspect is the easy one for computer > scientists to > understand: what techniques are used to find corresponding entries across > datasets? (e.g. edit distances between strings, probabilistic models, > etc.) > > However, the social aspect is the more important one. Where is the > data coming from? Who is managing it? How much effort can be expended > linking or reviewing linked records, and what are the costs of > incorrect or incomplete linkage? It is only in the context of the > answers to these questions that competing algorithms can be evaluated. > > With these thoughts in mind, it seems reasonable to begin by writing a > use case scenario that focuses on this issue. Then, a suitable > dataset can be chosen that is both representative and illustrative of > this problem. Finally, the algorithmic work can be applied to the > dataset. I've started a SIMILE WIKI entry on this at http://simile.mit.edu/wiki/DatasetLinkage but I guess MacKenzie and Eric may have some interesting comments here. Basically, at the moment we use the Levenshtein edit distance to match labels in different collections. We also look up people in the OCLC authority server. Here if a we get back more than a single match, we try to disambiguate using year of birth and year of death (if that information is present, it may not be). I'm not sure of the edit distance measure that the OCLC server uses, but it must use one as it returns inexact matches. We also look up people in the Wikipedia. Sometimes this is helpful for disambiguation, as the Wikipedia contains text which can be used to confirm we have the right individual (e.g. the artist as opposed to the mass murder) and it may contain year of birth / year of death, although there is no obvious heuristic to automatically extract that information via screenscraping for OCLC disambiguation so it must be done by hand. We automatically sort records with ambiguous matches from those that appear to have unambiguous matches to speed up human review, but really a human should review all computer generated matches for errors. The Levenshtein edit distance is not ideal, because we are really looking to for an ad-hoc stemming algorithm than for catching typing errors. With Levenshtein though corpse and corpses are close together (which is good) but so are mountain and fountain (which is bad). I'm trying to gather some resources on this, see http://simile.mit.edu/wiki/EditDistances However answering Nick's second set of questions is more difficult. > Where is the data coming from? Well at the moment, Artstor, OCW, LOC, and being screenscraped from the Wikipedia - see http://simile.mit.edu/wiki/DatasetsWeCurrentlyUse > Who is managing it? We are, but we are not really in the business of managing the data. The approach we have adopted is to have a canonical dataset (the original XML) then we have a transform we can run multiple times to get the data in the form we want. The situation for record linkage is slightly different. We generate a set of matches, that then need to be reviewed by hand. I've paid some attention to try reduce the human burden e.g. using heuristics to try to pick the best match automatically and sorting so the human can review the worst records first. Once they have reviewed it, they copy the corrected information by hand so it can be used in the application. One possible future improvement would be to store the human's decisions, so they only need to make them once, and those decisions are retained for multiple runs of the algorithm, a bit like adding words to a dictionary in a word processor so you don't get reminded about them each time you run the spell checker. > How much effort can be expended linking or reviewing linked records We want to minimize this, although the automated procedure we have now is a lot better than the by hand approach we used for the first demo. Its not ideal though ... there are still matches we must be missing. > and what are the costs of incorrect or incomplete linkage? We need linkage as that is what we are interested in. So the more linkage we can generate the better. However incorrect linkage confuses the user. It is hard to assign a "cost" to this. For example if I have a query "find me all the artworks by Yoko Ono?" then incomplete linkage might mean I might not get pieces pertaining to when Yoko Ono was involved in Fluxus. For an example of incorrect linkage, the Artstor collection features an artist called John Graham, but when we look that up in Wikipedia it gets matched to John Gilbert Graham, the plane bomber who killed 44 people with a bomb. I would say the latter is more acceptable than the former, put thats a very subjective judgement, primarily due to the unsavoury nature of the mismatch. > It is only in the context of the > answers to these questions that competing algorithms can be evaluated. Yes, but answering these questions may be hard - comments anyone? One observation is we can dramatically improve our accuracy if we have two disjoint pieces of information e.g. name and year of birth. For the current code, see Authorities.java and Distance.java in Longwell. Dr Mark H. Butler Research Scientist, HP Labs Bristol http://www-uk.hpl.hp.com/people/marbut
Received on Tuesday, 20 April 2004 08:50:34 UTC