Re: Canonicalizing names (was Re: XSLT script for IMS)

Butler, Mark wrote:

>However I've been looking at the OCW data today and I do think it would be a
>good idea to put the contents of the <Keyword> elements into the OCW RDF.
>While it is true that <taxonpath><source> only has two values - LCSH and CIP
>- there is variation in keyword as shown in the enclosed file.
>  
>

Agreed, and that is why Keyword is included in the transformation.  
According to the IMS RDF specification, LOM-Keyword should be translated 
into dc:subject, which is what I have done.

>Dr Mark H. Butler
>Research Scientist                HP Labs Bristol
>mark-h_butler@hp.com
>Internet: http://www-uk.hpl.hp.com/people/marbut/
>
>  
>
>>-----Original Message-----
>>From: Kevin Smathers [mailto:kevin.smathers@hp.com]
>>Sent: 24 October 2003 18:12
>>To: Butler, Mark
>>Cc: SIMILE public list
>>Subject: Canonicalizing names (was Re: XSLT script for IMS)
>>
>>
>>I've been investigating the name formats used in the OCW xml files.  
>>I've attached a complete listing of the names as found using the 
>>following xmlstarlet command:
>>
>>$ xml sel -T -t -m //Entity -v . -n *.xml | sort | uniq >namelist.txt
>>
>>There are several names here that I would expect to cause trouble:
>>
>>Gleason's Pictorial
>>Brown
>>United States of America
>>Smithsonian Institution
>>Glenn Ellison; Sara Ellison
>>Getty Images
>>Peters, W. T.
>>Prof. Joseph Ferriera, Thomas Grayson
>>
>>The main two formats are "[honorific] firstname lastname[, 
>>appelation]", 
>>and "lastname, firstname [middlename or initial]", but these make up 
>>fewer than half of the records as a whole.
>>
>>The OCLC web service does a pretty good job of finding matches in the 
>>"lastname, firstname [middlename or initial]" case, but only attempts 
>>word-matches in the "firstname lastname" case and fails completely if 
>>the honorific is left attached.  To do this yourself try for example 
>>searching for "Tom Leighton" (see MacKenzie's e-mail for the value of 
>>oclcservice):
>>
>>$ wget 
>>"http://$oclcservice?method=getCompleteSelectedNameAuthority&n
>>ame=Tom+Leighton&maxList=10&serviceType=rest&isPersonalName=tr
>>ue" -O leighton.tmp
>>$ xml fo leighton.tmp >leighton.xml
>>
>>
>>The results are in the second attachment.  As you can see, 'Tom 
>>Leighton' was matched against 'Wendt, Thomas Leighton' using 
>>word-match, 
>>whereas 'Leighton, Tom' would return a superior phrase-match.
>>
>>The degerate cases shown above don't yield any useful results 
>>from the 
>>OCLC web service.
>>
>>Cheers,
>>-kls
>>
>>    
>>
> 
>
>  
>
>------------------------------------------------------------------------
>
>    <Keyword>AM and FM modulation</Keyword>
>    <Keyword>Age of Reason</Keyword>
>    <Keyword>Algebra, Universal</Keyword>
>    <Keyword>Almereyda</Keyword>
>    <Keyword>Ampere's law</Keyword>
>    <Keyword>Basic electric circuits</Keyword>
>    <Keyword>C++</Keyword>
>    <Keyword>C</Keyword>
>    <Keyword>Calculus of operations </Keyword>
>    <Keyword>Concepts of electrostatic field and potential, electrostatic energy</Keyword>
>    <Keyword>Congress</Keyword>
>    <Keyword>Congressional behavior</Keyword>
>    <Keyword>Coulomb's law</Keyword>
>    <Keyword>DNA replication</Keyword>
>    <Keyword>Doppler effect</Keyword>
>    <Keyword>E-commerce</Keyword>
>    <Keyword>Economics</Keyword>
>    <Keyword>Electric currents</Keyword>
>    <Keyword>Electromagnetic waves</Keyword>
>    <Keyword>Faraday's law of induction</Keyword>
>    <Keyword>Fourier transforms</Keyword>
>    <Keyword>Fresnel and Faunhofer diffraction </Keyword>
>    <Keyword>GIS</Keyword>
>    <Keyword>Generalized spaces</Keyword>
>    <Keyword>Industrial engineering</Keyword>
>    <Keyword>Introduction to electromagnetism and electrostatics</Keyword>
>    <Keyword>Java</Keyword>
>    <Keyword>Julie Taymor</Keyword>
>    <Keyword>Kenneth Branagh</Keyword>
>    <Keyword>Kurosawa</Keyword>
>    <Keyword>Laurence Olivier</Keyword>
>    <Keyword>Line geometry</Keyword>
>    <Keyword>Linear algebra</Keyword>
>    <Keyword>MIT</Keyword>
>    <Keyword>Macroeconomics</Keyword>
>    <Keyword>Magnetic materials</Keyword>
>    <Keyword>Management science</Keyword>
>    <Keyword>Markov processes</Keyword>
>    <Keyword>Mathematical analysis </Keyword>
>    <Keyword>Maxwell's equations</Keyword>
>    <Keyword>Mechanical translation</Keyword>
>    <Keyword>Orson Welles</Keyword>
>    <Keyword>Polanski</Keyword>
>    <Keyword>Richard Loncraine</Keyword>
>    <Keyword>Scheme+</Keyword>
>    <Keyword>Scheme</Keyword>
>    <Keyword>Shakespeare</Keyword>
>    <Keyword>Speech</Keyword>
>    <Keyword>Systems engineering</Keyword>
>    <Keyword>TV</Keyword>
>    <Keyword>Time-varying fields</Keyword>
>    <Keyword>Topology</Keyword>
>    <Keyword>Wave optics</Keyword>
>    <Keyword>Zeffirelli</Keyword>
>    <Keyword>abstract types</Keyword>
>    <Keyword>advertising</Keyword>
>    <Keyword>air transportation systems</Keyword>
>    <Keyword>air-water exchange</Keyword>
>    <Keyword>apertures and stops</Keyword>
>    <Keyword>auctions</Keyword>
>    <Keyword>aurora borealis</Keyword>
>    <Keyword>bed-water exchange</Keyword>
>    <Keyword>binary stars</Keyword>
>    <Keyword>black holes</Keyword>
>    <Keyword>blue skies</Keyword>
>    <Keyword>boundary layers</Keyword>
>    <Keyword>bullet trains</Keyword>
>    <Keyword>buoyancy-driven flows</Keyword>
>    <Keyword>car coils</Keyword>
>    <Keyword>catalytic proteins</Keyword>
>    <Keyword>color perception</Keyword>
>    <Keyword>competition</Keyword>
>    <Keyword>computer graphics</Keyword>
>    <Keyword>computer</Keyword>
>    <Keyword>conductors</Keyword>
>    <Keyword>cultural history</Keyword>
>    <Keyword>customer orientation</Keyword>
>    <Keyword>data abstraction</Keyword>
>    <Keyword>data structures</Keyword>
>    <Keyword>denotational semantics</Keyword>
>    <Keyword>dielectrics</Keyword>
>    <Keyword>differential equations</Keyword>
>    <Keyword>digital circuits</Keyword>
>    <Keyword>diode circuits</Keyword>
>    <Keyword>dissolution</Keyword>
>    <Keyword>distribution policy</Keyword>
>    <Keyword>dynamic programming</Keyword>
>    <Keyword>econometrics</Keyword>
>    <Keyword>educational technology</Keyword>
>    <Keyword>eigen values</Keyword>
>    <Keyword>electric charge</Keyword>
>    <Keyword>electric motors</Keyword>
>    <Keyword>electric shock treatment</Keyword>
>    <Keyword>electric structure of matter</Keyword>
>    <Keyword>electrical circuits</Keyword>
>    <Keyword>electro-mechanical devices</Keyword>
>    <Keyword>electrocardiograms</Keyword>
>    <Keyword>electrodynamics</Keyword>
>    <Keyword>empirical economics</Keyword>
>    <Keyword>engineering</Keyword>
>    <Keyword>finance</Keyword>
>    <Keyword>functional programming language</Keyword>
>    <Keyword>gene regulation</Keyword>
>    <Keyword>genetic recombination</Keyword>
>    <Keyword>graphical user interface</Keyword>
>    <Keyword>haloes around sun and moon</Keyword>
>    <Keyword>heuristics</Keyword>
>    <Keyword>highway systems</Keyword>
>    <Keyword>image formation </Keyword>
>    <Keyword>imperative programming language</Keyword>
>    <Keyword>inference</Keyword>
>    <Keyword>integer programming</Keyword>
>    <Keyword>intellectual history</Keyword>
>    <Keyword>interferometers</Keyword>
>    <Keyword>lake systems</Keyword>
>    <Keyword>language</Keyword>
>    <Keyword>large systems</Keyword>
>    <Keyword>lens design</Keyword>
>    <Keyword>lightning</Keyword>
>    <Keyword>linear differential equations </Keyword>
>    <Keyword>linear programming</Keyword>
>    <Keyword>linguistics</Keyword>
>    <Keyword>logistics</Keyword>
>    <Keyword>magnetic fields</Keyword>
>    <Keyword>magnetic levitation</Keyword>
>    <Keyword>marketing</Keyword>
>    <Keyword>mass spectrometers</Keyword>
>    <Keyword>mathematical economics</Keyword>
>    <Keyword>matrix theory </Keyword>
>    <Keyword>media design</Keyword>
>    <Keyword>meta-circular interpreters</Keyword>
>    <Keyword>metal detectors</Keyword>
>    <Keyword>modularity</Keyword>
>    <Keyword>modules</Keyword>
>    <Keyword>molecular diffusion</Keyword>
>    <Keyword>molecules</Keyword>
>    <Keyword>momentum transport in environmental flows</Keyword>
>    <Keyword>monopoly</Keyword>
>    <Keyword>moon</Keyword>
>    <Keyword>multiprocessing</Keyword>
>    <Keyword>musical instruments</Keyword>
>    <Keyword>network optimization</Keyword>
>    <Keyword>neutron stars</Keyword>
>    <Keyword>non-linear programming</Keyword>
>    <Keyword>numerical methods</Keyword>
>    <Keyword>object modeling</Keyword>
>    <Keyword>object oriented programming</Keyword>
>    <Keyword>object oriented</Keyword>
>    <Keyword>ocean transportation systems</Keyword>
>    <Keyword>ogic programming languages</Keyword>
>    <Keyword>oligopoly</Keyword>
>    <Keyword>op-amps</Keyword>
>    <Keyword>operational semantics</Keyword>
>    <Keyword>pacemakers</Keyword>
>    <Keyword>particle accelerators (a.k.a. atom smashers or colliders)</Keyword>
>    <Keyword>phase partitioning</Keyword>
>    <Keyword>philosophy</Keyword>
>    <Keyword>photometry</Keyword>
>    <Keyword>planets</Keyword>
>    <Keyword>polarization</Keyword>
>    <Keyword>political process</Keyword>
>    <Keyword>polymorphism</Keyword>
>    <Keyword>positive definite matrices</Keyword>
>    <Keyword>price discrimination</Keyword>
>    <Keyword>pricing</Keyword>
>    <Keyword>problem solving</Keyword>
>    <Keyword>product strategy</Keyword>
>    <Keyword>programming language</Keyword>
>    <Keyword>programming</Keyword>
>    <Keyword>project management</Keyword>
>    <Keyword>protein binding</Keyword>
>    <Keyword>public management</Keyword>
>    <Keyword>public opinion surveys</Keyword>
>    <Keyword>public policy</Keyword>
>    <Keyword>radio telescopes</Keyword>
>    <Keyword>radiometry</Keyword>
>    <Keyword>radios</Keyword>
>    <Keyword>rainbows</Keyword>
>    <Keyword>ray-tracing</Keyword>
>    <Keyword>red sunsets</Keyword>
>    <Keyword>resolution </Keyword>
>    <Keyword>river systems</Keyword>
>    <Keyword>scalar transport in environmental flows</Keyword>
>    <Keyword>searching</Keyword>
>    <Keyword>settling and coagulation</Keyword>
>    <Keyword>software design</Keyword>
>    <Keyword>software development</Keyword>
>    <Keyword>software testing</Keyword>
>    <Keyword>software</Keyword>
>    <Keyword>sorting</Keyword>
>    <Keyword>space-bandwidth product </Keyword>
>    <Keyword>spatial analysis </Keyword>
>    <Keyword>specification</Keyword>
>    <Keyword>spectral analysis</Keyword>
>    <Keyword>spectroscopy</Keyword>
>    <Keyword>speech disorders</Keyword>
>    <Keyword>speech prosody</Keyword>
>    <Keyword>speech recognition</Keyword>
>    <Keyword>stars</Keyword>
>    <Keyword>statistics</Keyword>
>    <Keyword>stratification in lakes</Keyword>
>    <Keyword>super-novae</Keyword>
>    <Keyword>superconductivity</Keyword>
>    <Keyword>systems of equations</Keyword>
>    <Keyword>telescopes</Keyword>
>    <Keyword>transients</Keyword>
>    <Keyword>transistor circuits</Keyword>
>    <Keyword>turbulent diffusion</Keyword>
>    <Keyword>type systems</Keyword>
>    <Keyword>uniaxial rotation</Keyword>
>    <Keyword>vector spaces</Keyword>
>    <Keyword>voting behavior</Keyword>
>    <Keyword>water transportation</Keyword>
>    <Keyword>wave-guiding </Keyword>
>    <Keyword>waveform analysis</Keyword
>


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");

Received on Monday, 27 October 2003 10:43:38 UTC