- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Mon, 27 Oct 2003 14:42:11 -0000
- To: SIMILE public list <www-rdf-dspace@w3.org>
- Message-ID: <E864E95CB35C1C46B72FEA0626A2E808206212@0-mail-br1.hpl.hp.com>
Hi Kevin So to summarise the issues: 1. It sounds like it would be desirable to separate honorifics out. For the OCW case this is easy as the honorifics we are likely to encounter are Prof. or Professor - a few lines of XSLT would suffice. Honorifics for the Artstor case are more difficult - they include Bishop, Queen, Sir etc. 2. For OCW, a bigger problem is that personal and organisational names are mixed. There are a few instances of this in the Artstor data, but generally Artstor distinguishes between the two. 3. Another problem with OCW is when two names have been combined in a single field. We could just regard this as an encoding error? 4. It's unlikely the OCLC web service will solve these issues - we need to get the name in something approaching a canonicalized form first. However as Andy has noted, I don't think it is worth expending much more time trying to solve these problems now. However I've been looking at the OCW data today and I do think it would be a good idea to put the contents of the <Keyword> elements into the OCW RDF. While it is true that <taxonpath><source> only has two values - LCSH and CIP - there is variation in keyword as shown in the enclosed file. Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/ > -----Original Message----- > From: Kevin Smathers [mailto:kevin.smathers@hp.com] > Sent: 24 October 2003 18:12 > To: Butler, Mark > Cc: SIMILE public list > Subject: Canonicalizing names (was Re: XSLT script for IMS) > > > I've been investigating the name formats used in the OCW xml files. > I've attached a complete listing of the names as found using the > following xmlstarlet command: > > $ xml sel -T -t -m //Entity -v . -n *.xml | sort | uniq >namelist.txt > > There are several names here that I would expect to cause trouble: > > Gleason's Pictorial > Brown > United States of America > Smithsonian Institution > Glenn Ellison; Sara Ellison > Getty Images > Peters, W. T. > Prof. Joseph Ferriera, Thomas Grayson > > The main two formats are "[honorific] firstname lastname[, > appelation]", > and "lastname, firstname [middlename or initial]", but these make up > fewer than half of the records as a whole. > > The OCLC web service does a pretty good job of finding matches in the > "lastname, firstname [middlename or initial]" case, but only attempts > word-matches in the "firstname lastname" case and fails completely if > the honorific is left attached. To do this yourself try for example > searching for "Tom Leighton" (see MacKenzie's e-mail for the value of > oclcservice): > > $ wget > "http://$oclcservice?method=getCompleteSelectedNameAuthority&n > ame=Tom+Leighton&maxList=10&serviceType=rest&isPersonalName=tr > ue" -O leighton.tmp > $ xml fo leighton.tmp >leighton.xml > > > The results are in the second attachment. As you can see, 'Tom > Leighton' was matched against 'Wendt, Thomas Leighton' using > word-match, > whereas 'Leighton, Tom' would return a superior phrase-match. > > The degerate cases shown above don't yield any useful results > from the > OCLC web service. > > Cheers, > -kls >
Attachments
- text/plain attachment: keywords.txt
Received on Monday, 27 October 2003 09:42:34 UTC