Re: Record Linkage in Simile from Nick Matsakis on 2003-10-27 (www-rdf-dspace@w3.org from October 2003)

From: Nick Matsakis <matsakis@mit.edu>
Date: Mon, 27 Oct 2003 14:29:39 -0500 (EST)
To: Kevin Smathers <kevin.smathers@hp.com>
Cc: SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <Pine.OSX.4.56.0310271419430.1719@artoo.ai.mit.edu>

On Mon, 27 Oct 2003, Kevin Smathers wrote:

NM> My second requirement for ingestion software is that any record
NM> linkage it does, including name canonicalization, err on the side of
nm> caution. ... Of these, linking distinct entities is the more grave,
NM> for reasons I hope are obvious.

> Not sure I agree here.  When performing a search it is usually better to
> get back extra information that wasn't requested than to miss data that
> was requested.  A user can usually quickly sort out records that don't
> apply, so as long as the extra data is within a small fraction of the
> targeted data there is at least something to work with.

First, I am not suggesting that we never attempt a linkage that may result
in an incorrect match, but rather that such linkages never happen in
software that is simply intended to translate one metadata format to RDF.

In my terminology, ingesting should be simple, digesting can be complex.
The idea here is that we're going to need to write custom software for
each format that we want to import, but it would be nice if the same
frameworks could be used for identifying duplicates and translating schema
once the data is in RDF.

On the matter of whether false matches are worse than false misses, I
still think this is the case.  If you give two distinct resources the
same URI, the result isn't that a user will get an irrelevant record as a
result of a search but rather that a user will get a relevant record with
incorrect information.  This seems worse to me than getting back two
relevant records.

Nick

Received on Monday, 27 October 2003 14:29:47 UTC