Re: Progress with ArtStor

Butler, Mark wrote:

>I've also been playing with the XSLT stylesheet to translate the data into
>RDF. I added some code to generate URIs for creators, in order to normalize
>duplicates. Andy spotted that this was including spaces in URIs which is
>illegal, but I was able to fix this by moving to XSLT 2.0 which has several
>new functions for string manipulation. It was possible to attempt to split
>personal_name into forename, surname, DOB and DOD, but there are some
>degenerate cases in the data e.g.
>
>Brosse, Salomon de,c.1562-1626
>Smythson, Robert,1534 or 5-1614
>Lescot, Pierre,ca. 1510-1578
>Goujon, Jean,16th cent
>Catherine de M裩cis,Queen, consort of Henry II, King of France,1519-1589
>Pritchard, T. F.,D. 1775
>Paxton, Joseph,Sir,1803-1865
>
>I guess the best way to eliminate these problems would be to only split it
>if DOB and DOD contain numerics or are empty, otherwise not split? Any other
>suggestions here are welcome?
>
>
>  
>
I've also taken a brief look at the ArtStor corpus.  Nothing by Frank 
Lloyd Wright, nor I. M. Pei in their data, so I doubt that architecture 
is in the corpus.  Several works of art that I did find include the word 
'circa' or an abbreviation for the same in their annotations of dates, 
especially the date that the work of art was created. 

You could derive a subproperty ApproximateDate from Date and fill in 
with the dates found (or what can be extracted from them).  I hadn't 
found most of the ones you list above, but Perl patterns could certainly 
distinguish between the types that you've presented.  At least the data 
seems to consistently place dates in the final field.   Catherine de 
Medicis could be tricky, as would preserving the honorifics.

More troubling is translation of certain fields; "Hans von Koln" is 
sometimes translated as "Hans from Cologne", or "Polymedes of Argos", 
and location indicators like 'da', 'de', 'le', and 'von' are sometimes 
found in the Surname, sometimes in the next field.  Also sometimes the 
surname and givennames are in foward order rather than reverse, and 
sometimes the Personal_Name field is used to name multiple individuals 
and even companies (Holabird & Root (Chicago, Ill.)). 

May I return to my suggestion that we should not try to canonicalize 
Person records, but instead use them as we find them?

-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");

Received on Wednesday, 15 October 2003 15:17:51 UTC