Progress with ArtStor from Butler, Mark on 2003-10-15 (www-rdf-dspace@w3.org from October 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Wed, 15 Oct 2003 18:25:53 +0100
To: "'www-rdf-dspace@w3.org'" <www-rdf-dspace@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E8082061D2@0-mail-br1.hpl.hp.com>

Hi team,

I have been playing with the artstor corpus. I have extracted two data sets,
sample_small.xml and sample_medium.xml, for testing purposes. They contain
approximately 100 and 1000 records respectively, so are much more manageable
than the entire dataset. These data sets are in the IPS Sources CVS at
simile/corpus/artstor, along with the stylesheet artstor.xsl

I've also been playing with the XSLT stylesheet to translate the data into
RDF. I added some code to generate URIs for creators, in order to normalize
duplicates. Andy spotted that this was including spaces in URIs which is
illegal, but I was able to fix this by moving to XSLT 2.0 which has several
new functions for string manipulation. It was possible to attempt to split
personal_name into forename, surname, DOB and DOD, but there are some
degenerate cases in the data e.g.

Brosse, Salomon de,c.1562-1626
Smythson, Robert,1534 or 5-1614
Lescot, Pierre,ca. 1510-1578
Goujon, Jean,16th cent
Catherine de M裩cis,Queen, consort of Henry II, King of France,1519-1589
Pritchard, T. F.,D. 1775
Paxton, Joseph,Sir,1803-1865

I guess the best way to eliminate these problems would be to only split it
if DOB and DOD contain numerics or are empty, otherwise not split? Any other
suggestions here are welcome?

To process the stylesheet you'll need Saxon 7.7 to use XSLT 2.0. 

Apart from this, the data coming out of the stylesheet seems to look okay in
N3. I need to adopt a namespace that we have control of, as Andy suggested,
but apart from that I think it should be reasonably useful. Also some of the
property names are counterinitutive, for example I changed creation_date to
metadata_creation_date as this made more sense. There may be others that
could do with changing. This is due to flattening out the nesting in the XML
version. 

However I have broken protocol as we said at the core meeting on Monday that
Andy would work on VRA and I would work on CIDOC. Andy and I spoke this
morning, so he's aware I've been working on this, but what I'd like to
happen next is to get some feedback on the RDF/XML or N3 output from other
team members? However we can't post the output on the web - I could check it
into the CVS - but is that useful for others? Any other suggestions?

The small and medium files are 392KB and 4300KB respectively, so I could
email the small one to individual team members, but the medium is too big. 

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Wednesday, 15 October 2003 13:26:29 UTC