FW: Record Linkage in Simile

FYI

-----Original Message-----
From: Nick Matsakis [mailto:matsakis@mit.edu]
Sent: 24 October 2003 17:52
To: Mark_Butler@hplb.hpl.hp.com
Subject: Record Linkage in Simile



Mark,

I'm not sure if you remember, but I'm David Karger's student who is
working on record linkage for RDF.  I've been following the discussion on
the simile list about the issues you are encountering, and I'd like to
volunteer to do some work on this problem for Simile.

For full disclosure, my interest is in developing robust algorithms for
matching up URIs that refer to the "same" entity. I'd like to do this in
some way that is very general, in that it minimizes the amount of domain
specific knowledge that must be entered.  In other words, I'd like to use
the same algorithm for matching papers that I use for matching authors or
other resources.  One of the results I expect is that this will result in
performance that is not quite as good as algorithms that include a lot of
hand-tuning.  However, discovering this will still require me to implement
domain-specific algorithms for comparison, and I would very much like to
compare my stuff to OCLC and others.

So, if you are interested, please let me know what I can do to help out.
At the end of the day, what I would like to have is a corpus of RDF data
with the duplicate entries labelled so that I can test different
algorithms on it, and hopefully a body of code to support such tests.

I've only been following the simile discussions at a high level, so I'm
not sure exactly what your needs are or exactly what the dataflow in
Simile looks like.  Do you have a fixed corpus for which you need to
extract duplicates, or do you expect to be doing this frequently?  What
are the short term and long term applications for record linking in
Simile?

Regards,

Nick Matsakis

Received on Monday, 27 October 2003 05:46:28 UTC