- From: Nick Matsakis <matsakis@mit.edu>
- Date: Tue, 28 Oct 2003 14:59:55 -0500 (EST)
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: SIMILE public list <www-rdf-dspace@w3.org>
On Tue, 28 Oct 2003, Butler, Mark wrote: > in our conceptual model we have first class objects that do not > represent people, and where appropriate if those objects are about the > same "thing" then we potentially want to link them. This is definitely true. In the vast majority of record linkage literature, the data is stored as a flat file of records and thus there is only one type of record that is being linked; Some studies work with people entities while others work with bibliographic entities. More recently, there has been interest in working with relational data and simultaneously matching up records of different types. For example, matching works, authors, and institutions from multiple sources simultaneously. This problem is the core of my research interests. > Can you explain the difference between "record linking" and "record > merging"? Record Linking is the problem of identifying duplicate entities in your data. Record mergins is the problem of what to do with this knowledge. For example, suppose at the end of the linking stage our program says "these three URIs refer to the same person". What then? You can add an equivalentTo statements between each pair, you could discard all but one URI and map the statements about the others to be statements about that one, you could take the most recent statements or only the statements where the records agree. I'm not suggesting these as good ideas, but rather to say this is nontrivial. The right thing to do will ultimately depend on the requirements of the application. While I believe there is utility in keeping these problems separate, linking and merging are tightly coupled and it could be advantageous for the interface between the two stages to be more than simply "these records should be linked". For example, if your data has many misspellings, the linker might have an idea of which variant is most likely to be correct. The merger could be informed of this, rather than solving the same problem all over again. Nick Matsakis
Received on Tuesday, 28 October 2003 14:59:59 UTC