- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Wed, 15 Oct 2003 12:10:36 -0700
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
Butler, Mark wrote: >I've also been playing with the XSLT stylesheet to translate the data into >RDF. I added some code to generate URIs for creators, in order to normalize >duplicates. Andy spotted that this was including spaces in URIs which is >illegal, but I was able to fix this by moving to XSLT 2.0 which has several >new functions for string manipulation. It was possible to attempt to split >personal_name into forename, surname, DOB and DOD, but there are some >degenerate cases in the data e.g. > >Brosse, Salomon de,c.1562-1626 >Smythson, Robert,1534 or 5-1614 >Lescot, Pierre,ca. 1510-1578 >Goujon, Jean,16th cent >Catherine de M裩cis,Queen, consort of Henry II, King of France,1519-1589 >Pritchard, T. F.,D. 1775 >Paxton, Joseph,Sir,1803-1865 > >I guess the best way to eliminate these problems would be to only split it >if DOB and DOD contain numerics or are empty, otherwise not split? Any other >suggestions here are welcome? > > > > I've also taken a brief look at the ArtStor corpus. Nothing by Frank Lloyd Wright, nor I. M. Pei in their data, so I doubt that architecture is in the corpus. Several works of art that I did find include the word 'circa' or an abbreviation for the same in their annotations of dates, especially the date that the work of art was created. You could derive a subproperty ApproximateDate from Date and fill in with the dates found (or what can be extracted from them). I hadn't found most of the ones you list above, but Perl patterns could certainly distinguish between the types that you've presented. At least the data seems to consistently place dates in the final field. Catherine de Medicis could be tricky, as would preserving the honorifics. More troubling is translation of certain fields; "Hans von Koln" is sometimes translated as "Hans from Cologne", or "Polymedes of Argos", and location indicators like 'da', 'de', 'le', and 'von' are sometimes found in the Surname, sometimes in the next field. Also sometimes the surname and givennames are in foward order rather than reverse, and sometimes the Personal_Name field is used to name multiple individuals and even companies (Holabird & Root (Chicago, Ill.)). May I return to my suggestion that we should not try to canonicalize Person records, but instead use them as we find them? -- ======================================================== Kevin Smathers kevin.smathers@hp.com Hewlett-Packard kevin@ank.com Palo Alto Research Lab 1501 Page Mill Rd. 650-857-4477 work M/S 1135 650-852-8186 fax Palo Alto, CA 94304 510-247-1031 home ======================================================== use "Standard::Disclaimer"; carp("This message was printed on 100% recycled bits.");
Received on Wednesday, 15 October 2003 15:17:51 UTC