- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Mon, 27 Oct 2003 10:57:17 -0800
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: "'Nick Matsakis'" <matsakis@mit.edu>, SIMILE public list <www-rdf-dspace@w3.org>
>>My second requirement for ingestion software is that any >>record linkage it >>does, including name canonicalization, err on the side of >>caution. ... >>Of these, linking distinct entities >>is the more grave, for reasons I hope are obvious. >> >> > >That sounds like sensible advice. > Not sure I agree here. When performing a search it is usually better to get back extra information that wasn't requested than to miss data that was requested. A user can usually quickly sort out records that don't apply, so as long as the extra data is within a small fraction of the targeted data there is at least something to work with. Comparing that to losing results, the missing data may never be found by the searcher who may not even be aware that some data are missing. > >Are you aware of any literature on name canonicalization? It's just its such >a common problem, people have been trying to integrate disparate databases >since the 70's so its' possible someone has published a survey paper on >this? I did a quick search this morning, but I'm guessing they may have used >another term apart from name canonicalization. > >One problem here is name canonicalization is very locale dependent (consider >the differences in honorifics between English and French). > > > Also name translation (Johannes versus John), simplification (Robert versus Bob), abbreviation, and in some cultures a confusion between matrinymic and patronymic surnames (Madoc ap Owain ab Gwynedd versus Madoc ab Gruffedd), legal name change (Hillary Rodham versus Hillary Clinton), and non-canonical romanization (Taiwanese versus Chinese Pinyin) all make canonicalization problematic in any absolute sense. -- ======================================================== Kevin Smathers kevin.smathers@hp.com Hewlett-Packard kevin@ank.com Palo Alto Research Lab 1501 Page Mill Rd. 650-857-4477 work M/S 1135 650-852-8186 fax Palo Alto, CA 94304 510-247-1031 home ======================================================== use "Standard::Disclaimer"; carp("This message was printed on 100% recycled bits.");
Received on Monday, 27 October 2003 14:03:29 UTC