Update on corpus: Artstor and IMS

Hi team,

I've finished the work on the XSLT transform and RDFS Schema for Artstor. I
took Andy's schema and my transform, and after some discussion I've managed
to merge them. I think there are just two outstanding issues:

1. Subject can both be the subject of the work and indexing terms. At the
moment, the subject literal values are turned into URIs at Eric's
suggestion, e.g.

<http://web.mit.edu/simile/metadata/artstor/id#UCSD_41822000002277>
  vra:subject
<http://web.mit.edu/simile/metadata/artstor/subject#palm_columns> ,    
  <http://web.mit.edu/simile/metadata/artstor/subject#palm_vaults> , 
  <http://web.mit.edu/simile/metadata/artstor/subject#Ribbed_vaults> , 
  <http://web.mit.edu/simile/metadata/artstor/subject#Gothic> ,
 
<http://web.mit.edu/simile/metadata/artstor/subject#Toulouse_(France)--Jacob
in_church> .

However Andy suggested we should really only be turning the indexing terms
(the first four in the example) into URIs, not the subject of the work (the
last one in the example. It's difficult to do this without access to some
kind of thesaurus to determine if they are indexing terms or not. For now
though I suggest we leave this, any comments?

2. It may be possible to do further processing on the names, as we have
discussed on the list. 

you can find the transform and the schema at
/simile/corpus/artstor/artstor.xsl
/simile/corpus/artstor/vra-schema.n3

So I guess the next step is to run the transform on the entire Artstor
corpus. To do this we need to 

1. Resolve internalizaiton / UTF erros in the corpus. Kevin and I discussed
this and Kevin suggested adding an
encoding="ISO8859-1" 
attribute to the XML definition - is that okay? 

2. split up the corpus, 

3 then run the transform on subsections of it.  

Regarding the IMS corpus, Kevin and I checked the RDF / N3 versions of the
corpus generated from the XML into the CVS. I think it would be better to
just keep the XML version in the CVS for now, along with the necessary
stylesheets and makefiles - otherwise the CVS repository gets too big - what
do others think? 

Hopefully though we are not too far away from making IMS and Artstor records
available as RDF to the team as a whole, but we've mainly been thinking
about doing this using Joseki and Tomcat right? Will this be sufficient for
the Haystack team or do we need to make the data available as raw RDF/XML or
N3 as well?

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Thursday, 30 October 2003 08:41:58 UTC