Re: Record Linkage in Simile

>>My second requirement for ingestion software is that any 
>>record linkage it
>>does, including name canonicalization, err on the side of 
>>caution.  ...
>>Of these, linking distinct entities
>>is the more grave, for reasons I hope are obvious. 
>>    
>>
>
>That sounds like sensible advice.
>

Not sure I agree here.  When performing a search it is usually better to 
get back extra information that wasn't requested than to miss data that 
was requested.  A user can usually quickly sort out records that don't 
apply, so as long as the extra data is within a small fraction of the 
targeted data there is at least something to work with.  Comparing that 
to losing results, the missing data may never be found by the searcher 
who may not even be aware that some data are missing.

>
>Are you aware of any literature on name canonicalization? It's just its such
>a common problem, people have been trying to integrate disparate databases
>since the 70's so its' possible someone has published a survey paper on
>this? I did a quick search this morning, but I'm guessing they may have used
>another term apart from name canonicalization. 
>
>One problem here is name canonicalization is very locale dependent (consider
>the differences in honorifics between English and French). 
>
>  
>
Also name translation (Johannes versus John), simplification (Robert 
versus Bob), abbreviation, and in some cultures a confusion between 
matrinymic and patronymic surnames (Madoc ap Owain ab Gwynedd versus 
Madoc ab Gruffedd), legal name change (Hillary Rodham versus Hillary 
Clinton), and non-canonical romanization (Taiwanese versus Chinese 
Pinyin) all make canonicalization problematic in any absolute sense.



-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");

Received on Monday, 27 October 2003 14:03:29 UTC