- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Mon, 27 Oct 2003 10:57:17 -0800
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: "'Nick Matsakis'" <matsakis@mit.edu>, SIMILE public list <www-rdf-dspace@w3.org>
>>My second requirement for ingestion software is that any
>>record linkage it
>>does, including name canonicalization, err on the side of
>>caution. ...
>>Of these, linking distinct entities
>>is the more grave, for reasons I hope are obvious.
>>
>>
>
>That sounds like sensible advice.
>
Not sure I agree here. When performing a search it is usually better to
get back extra information that wasn't requested than to miss data that
was requested. A user can usually quickly sort out records that don't
apply, so as long as the extra data is within a small fraction of the
targeted data there is at least something to work with. Comparing that
to losing results, the missing data may never be found by the searcher
who may not even be aware that some data are missing.
>
>Are you aware of any literature on name canonicalization? It's just its such
>a common problem, people have been trying to integrate disparate databases
>since the 70's so its' possible someone has published a survey paper on
>this? I did a quick search this morning, but I'm guessing they may have used
>another term apart from name canonicalization.
>
>One problem here is name canonicalization is very locale dependent (consider
>the differences in honorifics between English and French).
>
>
>
Also name translation (Johannes versus John), simplification (Robert
versus Bob), abbreviation, and in some cultures a confusion between
matrinymic and patronymic surnames (Madoc ap Owain ab Gwynedd versus
Madoc ab Gruffedd), legal name change (Hillary Rodham versus Hillary
Clinton), and non-canonical romanization (Taiwanese versus Chinese
Pinyin) all make canonicalization problematic in any absolute sense.
--
========================================================
Kevin Smathers kevin.smathers@hp.com
Hewlett-Packard kevin@ank.com
Palo Alto Research Lab
1501 Page Mill Rd. 650-857-4477 work
M/S 1135 650-852-8186 fax
Palo Alto, CA 94304 510-247-1031 home
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Monday, 27 October 2003 14:03:29 UTC