RE: Record Linkage in Simile from Nick Matsakis on 2003-10-28 (www-rdf-dspace@w3.org from October 2003)

From: Nick Matsakis <matsakis@mit.edu>
Date: Tue, 28 Oct 2003 14:59:55 -0500 (EST)
To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
Cc: SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <Pine.OSX.4.56.0310281432210.3019@artoo.ai.mit.edu>

On Tue, 28 Oct 2003, Butler, Mark wrote:

> in our conceptual model we have first class objects that do not
> represent people, and where appropriate if those objects are about the
> same "thing" then we potentially want to link them.

This is definitely true.  In the vast majority of record linkage
literature, the data is stored as a flat file of records and thus there is
only one type of record that is being linked; Some studies work with
people entities while others work with bibliographic entities.  More
recently, there has been interest in working with relational data and
simultaneously matching up records of different types.  For example,
matching works, authors, and institutions from multiple sources
simultaneously. This problem is the core of my research interests.

> Can you explain the difference between "record linking" and "record
> merging"?

Record Linking is the problem of identifying duplicate entities in your
data.  Record mergins is the problem of what to do with this knowledge.
For example, suppose at the end of the linking stage our program says
"these three URIs refer to the same person".  What then? You can add an
equivalentTo statements between each pair, you could discard all but one
URI and map the statements about the others to be statements about that
one, you could take the most recent statements or only the statements
where the records agree. I'm not suggesting these as good ideas, but
rather to say this is nontrivial.  The right thing to do will ultimately
depend on the requirements of the application.

While I believe there is utility in keeping these problems separate,
linking and merging are tightly coupled and it could be advantageous for
the interface between the two stages to be more than simply "these records
should be linked". For example, if your data has many misspellings, the
linker might have an idea of which variant is most likely to be correct.
The merger could be informed of this, rather than solving the same problem
all over again.

Nick Matsakis

Received on Tuesday, 28 October 2003 14:59:59 UTC