- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Mon, 27 Oct 2003 10:45:36 -0000
- To: SIMILE public list <www-rdf-dspace@w3.org>
FYI -----Original Message----- From: Nick Matsakis [mailto:matsakis@mit.edu] Sent: 24 October 2003 17:52 To: Mark_Butler@hplb.hpl.hp.com Subject: Record Linkage in Simile Mark, I'm not sure if you remember, but I'm David Karger's student who is working on record linkage for RDF. I've been following the discussion on the simile list about the issues you are encountering, and I'd like to volunteer to do some work on this problem for Simile. For full disclosure, my interest is in developing robust algorithms for matching up URIs that refer to the "same" entity. I'd like to do this in some way that is very general, in that it minimizes the amount of domain specific knowledge that must be entered. In other words, I'd like to use the same algorithm for matching papers that I use for matching authors or other resources. One of the results I expect is that this will result in performance that is not quite as good as algorithms that include a lot of hand-tuning. However, discovering this will still require me to implement domain-specific algorithms for comparison, and I would very much like to compare my stuff to OCLC and others. So, if you are interested, please let me know what I can do to help out. At the end of the day, what I would like to have is a corpus of RDF data with the duplicate entries labelled so that I can test different algorithms on it, and hopefully a body of code to support such tests. I've only been following the simile discussions at a high level, so I'm not sure exactly what your needs are or exactly what the dataflow in Simile looks like. Do you have a fixed corpus for which you need to extract duplicates, or do you expect to be doing this frequently? What are the short term and long term applications for record linking in Simile? Regards, Nick Matsakis
Received on Monday, 27 October 2003 05:46:28 UTC