- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Fri, 24 Oct 2003 10:10:19 -0700
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: SIMILE public list <www-rdf-dspace@w3.org>
- Message-ID: <3F995CFB.4070505@hp.com>
I've been investigating the name formats used in the OCW xml files. I've attached a complete listing of the names as found using the following xmlstarlet command: $ xml sel -T -t -m //Entity -v . -n *.xml | sort | uniq >namelist.txt There are several names here that I would expect to cause trouble: Gleason's Pictorial Brown United States of America Smithsonian Institution Glenn Ellison; Sara Ellison Getty Images Peters, W. T. Prof. Joseph Ferriera, Thomas Grayson The main two formats are "[honorific] firstname lastname[, appelation]", and "lastname, firstname [middlename or initial]", but these make up fewer than half of the records as a whole. The OCLC web service does a pretty good job of finding matches in the "lastname, firstname [middlename or initial]" case, but only attempts word-matches in the "firstname lastname" case and fails completely if the honorific is left attached. To do this yourself try for example searching for "Tom Leighton" (see MacKenzie's e-mail for the value of oclcservice): $ wget "http://$oclcservice?method=getCompleteSelectedNameAuthority&name=Tom+Leighton&maxList=10&serviceType=rest&isPersonalName=true" -O leighton.tmp $ xml fo leighton.tmp >leighton.xml The results are in the second attachment. As you can see, 'Tom Leighton' was matched against 'Wendt, Thomas Leighton' using word-match, whereas 'Leighton, Tom' would return a superior phrase-match. The degerate cases shown above don't yield any useful results from the OCLC web service. Cheers, -kls Butler, Mark wrote: >Hi Kevin > >I've written some code that can do some canonicalization on names and >included it in templateSaxon.xsl - for people who are interested I include a >code fragment below. > >There are still problems with canonicalizing names in the files - for >example they use John W. Dower and John Dower, but as Andy said there is a >limit to how much we need to address this problem at the moment. > >Dr Mark H. Butler >Research Scientist HP Labs Bristol >mark-h_butler@hp.com >Internet: http://www-uk.hpl.hp.com/people/marbut/ > ><!-- This function canonicalizes names of the form "butler, mark" to >"mark_butler" --> > ><xsl:function name="str:orderName"> > <xsl:param name="name"/> > <xsl:choose> > <xsl:when test="matches($name,'.*,.*')"> > <xsl:variable name="tokenizedName" >select="tokenize($name,',')"/> > <xsl:variable name="surname" >select="item-at($tokenizedName,1)"/> > <xsl:variable name="forename" >select="item-at($tokenizedName,2)"/> > <xsl:value-of >select="normalize=space(concat($forename,concat(' ',$surname)))"/> > </xsl:when> > <xsl:otherwise> > <xsl:value-of select="normalize-space($name)"/> > </xsl:otherwise> > </xsl:choose> ></xsl:function> > ><!-- This function does further canonicalization on names to turn them into >URIs > 1. It strips out spaces and colons and replaces them with underscores. > 2. It converts the name to lower-case >--> > ><xsl:function name="str:canonicalizeName"> > <xsl:param name="name"/> > <xsl:variable name="spacefreename" >select="replace(replace(str:orderName($name),': ','_'),' ','_')"/> > <xsl:variable >name="lcletters">abcdefghijklmnopqrstuvwxyz</xsl:variable> > <xsl:variable >name="ucletters">ABCDEFGHIJKLMNOPQRSTUVWXYZ</xsl:variable> > <xsl:value-of >select="translate($spacefreename,$ucletters,$lcletters)"/> ></xsl:function> > ><xsl:template name="contrib"> > <lom:Entity> > <xsl:attribute >name="rdf:about">&ocw;contributors#<xsl:value-of >select="str:canonicalizeName(Entity)"/></xsl:attribute> > <vc:FN><xsl:value-of >select="str:orderName(Entity)"/></vc:FN> > </lom:Entity> ></xsl:template> > > > -- ======================================================== Kevin Smathers kevin.smathers@hp.com Hewlett-Packard kevin@ank.com Palo Alto Research Lab 1501 Page Mill Rd. 650-857-4477 work M/S 1135 650-852-8186 fax Palo Alto, CA 94304 510-247-1031 home ======================================================== use "Standard::Disclaimer"; carp("This message was printed on 100% recycled bits.");
Ahuja, Ravindra K. Belcher, John Bob Sauer Brady, Mathew Brown Brown, Eliphalet, Jr. Charles Stewart III Chew, Elaine Daniel Jackson David Cory David Gossard Dower, John W. Eisen Ellison, Sara Fisher Ergun, Özlem Eric Lander Ernst, Michael Evans Eye Wire Collection Fernández, José Ramón DÃaz Fillmore, Millard Franco, Francesco Frank Solomon George Barbastathis Getty Images Girin, Jo Gleason's Pictorial Glenn Ellison; Sara Ellison Goya, Francisco, 1746-1828. Grayson, Thomas H. Hass, P. Hawks, Francis L. Heidi Nepf Heine, William Heine, WIlliam Helen M. Hanson Janet Slifka Jeff Freidberg John Dower Judson Harward Kang, James S. Keiga, Kawahara Kenneth Stevens MacRobert, Alan Miyagawa, Shigeru Olivier Blanchard Perry, Matthew Calbraith Peter S. Donaldson Peters, W. T. Peters, W.T. Prof. Amedeo Odoni Prof. Arnold Barnett Prof. Bruce Maggs Prof. Gilbert Strang Prof. James Orlin Prof. Jeffrey S. Ravel Prof. Joseph Ferriera, Thomas Grayson Prof. Joseph S. Perkell Prof. Lorene Hoyt Prof. Richard Larson Prof. Shang-Hua Teng Prof. Shigeru Miyagawa Prof. Stefanie Shattuck Prof. Tom Leighton Ravi Sundaram Renjo, Shimooka Robert Weinberg Rosa Blackwood Ryder, J.S. Ryosenji Museum Collection Sara Ellison Shigeru Miyagawa Shikyo, Hayashi Smithsonian Institution Srinivas Devadas Stephen P. Bell Steven Lerman Stuart, C.B. Suzanne Flynn Sweeting, Andrew Sweeting, Andrew Tania A. Baker Tania Baker Test Faculty Thornton, Jayme United States of America Unknown Walter Lewin Wells, Scott A. Woodhouse, Jeremy
Attachments
- text/xml attachment: leighton.xml
Received on Friday, 24 October 2003 13:12:00 UTC