RE: Record Linkage in Simile

Hi Nick

> Is my impression correct that, in these corpora, it is only 
> 'people' that
> are being matched up, rather than say, published works or 
> other resource
> classes?

At the moment we have mainly looked at people, because there is obvious
variation in the format used for people's names. But I'd expect us to want
to match up FRBR group 1, group 2 and group 3 entities e.g.

i) a LOM with visual image A as subject, and visual image A (group 1
matching)
ii) a LOM about artist B, and a visual image A created by artist B (group 2
matching)
iii) a LOM discussing concept C, and a visual image discussing concept C
(group 3 matching)

It is interesting to note the library community does have ways of doing ii)
and iii) e.g. authority files and thesauri like the Getty AAT respectively. 

> My philosophy is that software that translate other formats into
> RDF should be be split into two functional stages. The first 
> stage should
> be a simple translation from the native format to RDF, the the second
> stage should apply more complex transformations to the 
> translated RDF to
> ensure the resulting metadata is as clean as possible. 
> ....
> This separates domain
> knowledge about the syntax of the original format from domain 
> knowledge
> about the semantics of the original format.  As a side benefit, it
> encourages the creation of an RDF schema for the original 
> source format.
> This allows us to go 'back to the source' when debugging digesting
> software without literally going back to the source.

Yes I think we are in agreement here, I'm keen to see this separation for
the same reason. 

> My second requirement for ingestion software is that any 
> record linkage it
> does, including name canonicalization, err on the side of 
> caution.  ...
> Of these, linking distinct entities
> is the more grave, for reasons I hope are obvious. 

That sounds like sensible advice.

Are you aware of any literature on name canonicalization? It's just its such
a common problem, people have been trying to integrate disparate databases
since the 70's so its' possible someone has published a survey paper on
this? I did a quick search this morning, but I'm guessing they may have used
another term apart from name canonicalization. 

One problem here is name canonicalization is very locale dependent (consider
the differences in honorifics between English and French). 

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Monday, 27 October 2003 12:58:07 UTC