- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Wed, 15 Oct 2003 18:25:53 +0100
- To: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
Hi team, I have been playing with the artstor corpus. I have extracted two data sets, sample_small.xml and sample_medium.xml, for testing purposes. They contain approximately 100 and 1000 records respectively, so are much more manageable than the entire dataset. These data sets are in the IPS Sources CVS at simile/corpus/artstor, along with the stylesheet artstor.xsl I've also been playing with the XSLT stylesheet to translate the data into RDF. I added some code to generate URIs for creators, in order to normalize duplicates. Andy spotted that this was including spaces in URIs which is illegal, but I was able to fix this by moving to XSLT 2.0 which has several new functions for string manipulation. It was possible to attempt to split personal_name into forename, surname, DOB and DOD, but there are some degenerate cases in the data e.g. Brosse, Salomon de,c.1562-1626 Smythson, Robert,1534 or 5-1614 Lescot, Pierre,ca. 1510-1578 Goujon, Jean,16th cent Catherine de M裩cis,Queen, consort of Henry II, King of France,1519-1589 Pritchard, T. F.,D. 1775 Paxton, Joseph,Sir,1803-1865 I guess the best way to eliminate these problems would be to only split it if DOB and DOD contain numerics or are empty, otherwise not split? Any other suggestions here are welcome? To process the stylesheet you'll need Saxon 7.7 to use XSLT 2.0. Apart from this, the data coming out of the stylesheet seems to look okay in N3. I need to adopt a namespace that we have control of, as Andy suggested, but apart from that I think it should be reasonably useful. Also some of the property names are counterinitutive, for example I changed creation_date to metadata_creation_date as this made more sense. There may be others that could do with changing. This is due to flattening out the nesting in the XML version. However I have broken protocol as we said at the core meeting on Monday that Andy would work on VRA and I would work on CIDOC. Andy and I spoke this morning, so he's aware I've been working on this, but what I'd like to happen next is to get some feedback on the RDF/XML or N3 output from other team members? However we can't post the output on the web - I could check it into the CVS - but is that useful for others? Any other suggestions? The small and medium files are 392KB and 4300KB respectively, so I could email the small one to individual team members, but the medium is too big. Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Wednesday, 15 October 2003 13:26:29 UTC