RE: Canonicalizing names (was Re: XSLT script for IMS) from Butler, Mark on 2003-10-27 (www-rdf-dspace@w3.org from October 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Mon, 27 Oct 2003 14:42:11 -0000
To: SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E808206212@0-mail-br1.hpl.hp.com>

Hi Kevin

So to summarise the issues:

1. It sounds like it would be desirable to separate honorifics out. For the
OCW case this is easy as the honorifics we are likely to encounter are Prof.
or Professor - a few lines of XSLT would suffice.

Honorifics for the Artstor case are more difficult - they include Bishop,
Queen, Sir etc.

2. For OCW, a bigger problem is that personal and organisational names are
mixed. There are a few instances of this in the Artstor data, but generally
Artstor distinguishes between the two.

3. Another problem with OCW is when two names have been combined in a single
field. We could just regard this as an encoding error?

4. It's unlikely the OCLC web service will solve these issues - we need to
get the name in something approaching a canonicalized form first. 

However as Andy has noted, I don't think it is worth expending much more
time trying to solve these problems now. 

However I've been looking at the OCW data today and I do think it would be a
good idea to put the contents of the <Keyword> elements into the OCW RDF.
While it is true that <taxonpath><source> only has two values - LCSH and CIP
- there is variation in keyword as shown in the enclosed file.

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

> -----Original Message-----
> From: Kevin Smathers [mailto:kevin.smathers@hp.com]
> Sent: 24 October 2003 18:12
> To: Butler, Mark
> Cc: SIMILE public list
> Subject: Canonicalizing names (was Re: XSLT script for IMS)
> 
> 
> I've been investigating the name formats used in the OCW xml files.  
> I've attached a complete listing of the names as found using the 
> following xmlstarlet command:
> 
> $ xml sel -T -t -m //Entity -v . -n *.xml | sort | uniq >namelist.txt
> 
> There are several names here that I would expect to cause trouble:
> 
> Gleason's Pictorial
> Brown
> United States of America
> Smithsonian Institution
> Glenn Ellison; Sara Ellison
> Getty Images
> Peters, W. T.
> Prof. Joseph Ferriera, Thomas Grayson
> 
> The main two formats are "[honorific] firstname lastname[, 
> appelation]", 
> and "lastname, firstname [middlename or initial]", but these make up 
> fewer than half of the records as a whole.
> 
> The OCLC web service does a pretty good job of finding matches in the 
> "lastname, firstname [middlename or initial]" case, but only attempts 
> word-matches in the "firstname lastname" case and fails completely if 
> the honorific is left attached.  To do this yourself try for example 
> searching for "Tom Leighton" (see MacKenzie's e-mail for the value of 
> oclcservice):
> 
> $ wget 
> "http://$oclcservice?method=getCompleteSelectedNameAuthority&n
> ame=Tom+Leighton&maxList=10&serviceType=rest&isPersonalName=tr
> ue" -O leighton.tmp
> $ xml fo leighton.tmp >leighton.xml
> 
> 
> The results are in the second attachment.  As you can see, 'Tom 
> Leighton' was matched against 'Wendt, Thomas Leighton' using 
> word-match, 
> whereas 'Leighton, Tom' would return a superior phrase-match.
> 
> The degerate cases shown above don't yield any useful results 
> from the 
> OCLC web service.
> 
> Cheers,
> -kls
>

Attachments

text/plain attachment: keywords.txt

Received on Monday, 27 October 2003 09:42:34 UTC