RE: Record Linkage in Simile from Nick Matsakis on 2003-10-27 (www-rdf-dspace@w3.org from October 2003)

From: Nick Matsakis <matsakis@mit.edu>
Date: Mon, 27 Oct 2003 11:59:26 -0500 (EST)
To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
Cc: SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <Pine.OSX.4.56.0310271117410.1493@artoo.ai.mit.edu>

On Mon, 27 Oct 2003, Butler, Mark wrote:

> We now have two [corpora]:
> 1. a large one from ArtStor of visual image metadata
> 2. a much smaller one of learning object data from OpenCourseWare

Is my impression correct that, in these corpora, it is only 'people' that
are being matched up, rather than say, published works or other resource
classes?

> One of the issues we've been discussing is actually how much of the name
> canonicalization should be done in the XSLT transform, and how much
> should be left to programs that work on the [ingested] RDF

I have faced this problem in bibliographic data extracted from bibtex
files.  My philosophy is that software that translate other formats into
RDF should be be split into two functional stages. The first stage should
be a simple translation from the native format to RDF, the the second
stage should apply more complex transformations to the translated RDF to
ensure the resulting metadata is as clean as possible.  You can think of
these as stages as first 'ingesting' and then 'digesting' the source
metadata. In the case of the current corpora, it seems like the XSLT
transformations should be considered 'ingesting' software.

There are two properties that I think ingesting software should have.
First, it should translate the source format (vcard, bibtex, vera, ...)
into an RDF schema that has a one-to-one mapping between the properties in
the source format and those in the RDF schema.  This separates domain
knowledge about the syntax of the original format from domain knowledge
about the semantics of the original format.  As a side benefit, it
encourages the creation of an RDF schema for the original source format.
This allows us to go 'back to the source' when debugging digesting
software without literally going back to the source.

My second requirement for ingestion software is that any record linkage it
does, including name canonicalization, err on the side of caution.  There
are two types of linkage errors, either linking distinct entities or
failing to link identical entities.  Of these, linking distinct entities
is the more grave, for reasons I hope are obvious. Thus, it should be
unacceptible for ingestion software to assign the same URI to distinct
entities, even if the cost is that it will miss some 'obvious'
connections.  Making these connections is the task of digesting software.

Nick Matsakis

Received on Monday, 27 October 2003 12:07:43 UTC