Canonicalizing names (was Re: XSLT script for IMS)

I've been investigating the name formats used in the OCW xml files.  
I've attached a complete listing of the names as found using the 
following xmlstarlet command:

$ xml sel -T -t -m //Entity -v . -n *.xml | sort | uniq >namelist.txt

There are several names here that I would expect to cause trouble:

Gleason's Pictorial
Brown
United States of America
Smithsonian Institution
Glenn Ellison; Sara Ellison
Getty Images
Peters, W. T.
Prof. Joseph Ferriera, Thomas Grayson

The main two formats are "[honorific] firstname lastname[, appelation]", 
and "lastname, firstname [middlename or initial]", but these make up 
fewer than half of the records as a whole.

The OCLC web service does a pretty good job of finding matches in the 
"lastname, firstname [middlename or initial]" case, but only attempts 
word-matches in the "firstname lastname" case and fails completely if 
the honorific is left attached.  To do this yourself try for example 
searching for "Tom Leighton" (see MacKenzie's e-mail for the value of 
oclcservice):

$ wget "http://$oclcservice?method=getCompleteSelectedNameAuthority&name=Tom+Leighton&maxList=10&serviceType=rest&isPersonalName=true" -O leighton.tmp
$ xml fo leighton.tmp >leighton.xml


The results are in the second attachment.  As you can see, 'Tom 
Leighton' was matched against 'Wendt, Thomas Leighton' using word-match, 
whereas 'Leighton, Tom' would return a superior phrase-match.

The degerate cases shown above don't yield any useful results from the 
OCLC web service.

Cheers,
-kls

Butler, Mark wrote:

>Hi Kevin
>
>I've written some code that can do some canonicalization on names and
>included it in templateSaxon.xsl - for people who are interested I include a
>code fragment below.
>
>There are still problems with canonicalizing names in the files - for
>example they use John W. Dower and John Dower, but as Andy said there is a
>limit to how much we need to address this problem at the moment. 
>
>Dr Mark H. Butler
>Research Scientist                HP Labs Bristol
>mark-h_butler@hp.com
>Internet: http://www-uk.hpl.hp.com/people/marbut/
>
><!-- This function canonicalizes names of the form "butler, mark" to
>"mark_butler" -->
>
><xsl:function name="str:orderName">
>	<xsl:param name="name"/>
>	<xsl:choose>
>		<xsl:when test="matches($name,'.*,.*')">
>			<xsl:variable name="tokenizedName"
>select="tokenize($name,',')"/>
>			<xsl:variable name="surname"
>select="item-at($tokenizedName,1)"/>
>			<xsl:variable name="forename"
>select="item-at($tokenizedName,2)"/>
>			<xsl:value-of
>select="normalize=space(concat($forename,concat(' ',$surname)))"/>
>		</xsl:when>
>		<xsl:otherwise>
>			<xsl:value-of select="normalize-space($name)"/>
>		</xsl:otherwise>
>	</xsl:choose>
></xsl:function>
>
><!-- This function does further canonicalization on names to turn them into
>URIs
>  1. It strips out spaces and colons and replaces them with underscores.
>  2. It converts the name to lower-case
>-->
>
><xsl:function name="str:canonicalizeName">
>	<xsl:param name="name"/>
>	<xsl:variable name="spacefreename"
>select="replace(replace(str:orderName($name),': ','_'),' ','_')"/>  
>	<xsl:variable
>name="lcletters">abcdefghijklmnopqrstuvwxyz</xsl:variable>
>	<xsl:variable
>name="ucletters">ABCDEFGHIJKLMNOPQRSTUVWXYZ</xsl:variable>
>	<xsl:value-of
>select="translate($spacefreename,$ucletters,$lcletters)"/>
></xsl:function>
>
><xsl:template name="contrib">
>	<lom:Entity>
>		<xsl:attribute
>name="rdf:about">&ocw;contributors#<xsl:value-of
>select="str:canonicalizeName(Entity)"/></xsl:attribute>
>		<vc:FN><xsl:value-of
>select="str:orderName(Entity)"/></vc:FN>
>    	</lom:Entity>
></xsl:template>
>
>  
>


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Ahuja, Ravindra K.
Belcher, John
Bob Sauer
Brady, Mathew 
Brown
Brown, Eliphalet, Jr.
Charles Stewart III
Chew, Elaine 
Daniel Jackson
David Cory
David Gossard
Dower, John W.
Eisen
Ellison, Sara Fisher
Ergun, Özlem 
Eric Lander
Ernst, Michael
Evans
Eye Wire Collection
Fernández, José Ramón Díaz
Fillmore, Millard
Franco, Francesco
Frank Solomon
George Barbastathis
Getty Images
Girin, Jo
Gleason's Pictorial
Glenn Ellison; Sara Ellison
Goya, Francisco, 1746-1828.
Grayson, Thomas H.
Hass, P.
Hawks, Francis L.
Heidi Nepf
Heine, William
Heine, WIlliam
Helen M. Hanson
Janet Slifka 
Jeff Freidberg
John Dower
Judson Harward
Kang, James S.
Keiga, Kawahara
Kenneth Stevens
MacRobert, Alan
Miyagawa, Shigeru 
Olivier Blanchard
Perry, Matthew Calbraith
Peter S. Donaldson
Peters, W. T.
Peters, W.T.
 Prof. Amedeo Odoni
Prof. Arnold Barnett
Prof. Bruce Maggs
Prof. Gilbert Strang
Prof. James Orlin
Prof. Jeffrey S. Ravel
Prof. Joseph Ferriera, Thomas Grayson
Prof. Joseph S. Perkell
Prof. Lorene Hoyt
 Prof. Richard Larson
Prof. Shang-Hua Teng
Prof. Shigeru Miyagawa
Prof. Stefanie Shattuck
Prof. Tom Leighton
Ravi Sundaram
Renjo, Shimooka
Robert Weinberg
Rosa Blackwood
Ryder, J.S. 
Ryosenji Museum Collection
Sara Ellison
Shigeru Miyagawa
Shikyo, Hayashi
Smithsonian Institution
Srinivas Devadas
Stephen P. Bell
Steven Lerman
Stuart, C.B. 
Suzanne Flynn
Sweeting, Andrew
Sweeting, Andrew 
Tania A. Baker
Tania Baker
Test Faculty
Thornton, Jayme
United States of America
Unknown
Walter Lewin
Wells, Scott A. 
Woodhouse, Jeremy

Received on Friday, 24 October 2003 13:12:00 UTC