RE: XSLT script for IMS

Hi Kevin

Thanks for doing this. 

> I don't think I've mentioned xmlstarlet before, but it has a pretty
> nice command-line syntax, and is basically an XML swiss-army knife of
> a tool.

Yes I've seen this, but not used it, it seemed it had good EXSLT support
(EXSLT is like a forerunner of XSLT 2.0, it is a set of extensions developed
by XML developers).

My first step was to try converting your RDF files to N3 because I find it
easier to examine them. However this hit some snags:

1. Jena didn't like filenames with spaces in, so I removed all the
whitespaces, then reloaded the files into the CVS.

2. I'm afraid I didn't fancy compiling XMLStarlet for Cygwin, so I adapted
your script to do the transforms using Saxon.

3. To make the RDF more output, I transformed the RDF files to N3 using Jena
e.g.

for i in RDF/*.rdf
do 
	echo $i
	java -cp "$CP" jena.rdfcopy "$i" RDF/XML N3 >"N3/`basename \"$i\"
.rdf`.n3"
done

4. Loading the files into Jena identified some syntatic errors in the
RDF/XML output by the XSLT stylesheet, mainly unqualified uses of rdf:about
and non-XML names e.g. 

WARN [main] (RDFDefaultErrorHandler.java:29) -
file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng
ineeringProblemSolving.rdf[3:182]: {W101} Unqualified use of rdf:about has
been deprecated.
 WARN [main] (RDFDefaultErrorHandler.java:29) -
file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng
ineeringProblemSolving.rdf[24:43]: {W108} Not an XML Name: 'Judson Harward'
 WARN [main] (RDFDefaultErrorHandler.java:29) -
file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng
ineeringProblemSolving.rdf[29:42]: {W108} Not an XML Name: 'Steven Lerman'
 WARN [main] (RDFDefaultErrorHandler.java:29) -
file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng
ineeringProblemSolving.rdf[36:236]: {W101} Unqualified use of rdf:about has
been deprecated.
etc

So changed line 48 of template.xsl from name="about" to name="rdf:about",
ditto line 118 and ditto line 136. I've committed template.xsl back to the
CVS with these problems fixed. 

The next step was to fix the XML names problem, this was happening due to
RDF like this

    <lom-life:author>
      <lom:Entity rdf:ID="Judson Harward">
        <vc:FN>Judson Harward</vc:FN>
      </lom:Entity>
    </lom-life:author>
    <dc:contributor>
      <lom:Entity rdf:ID="Steven Lerman">
        <vc:FN>Steven Lerman</vc:FN>
      </lom:Entity>

which is being generated by this code in the stylesheet

<xsl:template name="contrib">
    <lom:Entity>
	<xsl:attribute name="rdf:ID">
	    <xsl:value-of select="Entity"/>
	</xsl:attribute>
	<vc:FN><xsl:value-of select="Entity"/></vc:FN>
    </lom:Entity>
</xsl:template>

In the artstor transform I have solved this using replace, but unfortunately
this is an XSLT 2.0 feature so I don't know if XMLStarlet will support it -
e.g. 

<xsl:function name="str:urlencode">
  <xsl:param name="url"/>
  <xsl:value-of select="replace(replace(normalize-space($url),': ','_'),'
','_')"/>             
</xsl:function>

<xsl:template name="contrib">
	<lom:Entity>
		<xsl:attribute
name="rdf:about">&ocw;contributors#<xsl:value-of
select="str:urlencode(Entity)"/></xsl:attribute>
		<vc:FN><xsl:value-of select="Entity"/></vc:FN>
    	</lom:Entity>
</xsl:template>

note this uses a user defined function in the str namespace so you need to
define this at the top of stylesheet. I've committed this revised stylesheet
to CVS, called templateSaxon.xsl.

This generates names of the form

      <lom-life:author>
         <lom:Entity
rdf:about="http://ocw.mit.edu/contributors#Judson_Harward">
            <vc:FN>Judson Harward</vc:FN>
         </lom:Entity>
      </lom-life:author>
      <dc:contributor>
         <lom:Entity
rdf:about="http://ocw.mit.edu/contributors#Steven_Lerman">
            <vc:FN>Steven Lerman</vc:FN>
         </lom:Entity>
      </dc:contributor>

So it does generate unique URIs for people. Now of course we still have the
problem that these URIs are collection dependent, e.g. we might have a URI
http://ocw.mit.edu/contributors#Judson_Harward
but also a URI
http://web.mit.edu/simile/metadata/artstor/person#Harward,_Judson

but I think that is okay for now. Our main aim in converting the XML to RDF
is to get rid of duplicates within this collection, removing
inter-collection duplicates will come later. 

If you look at the N3, then you see this collapses the names except where
data entry has been inconsistent. For example if you look at
N3\21F_027J_VisualizingCultures.n3 you will see some names are encoded
surname, forename whereas others are encoded forename surename e.g. 

ocw:contributors#Miyagawa,_Shigeru
      a       lom:Entity ;
      vc:FN   "Miyagawa, Shigeru " .

ocw:contributors#Shigeru_Miyagawa
      a       lom:Entity ;
      vc:FN   "Shigeru Miyagawa" .

In other names there are differences in capitalization

ocw:contributors#Heine,_WIlliam
      a       lom:Entity ;
      vc:FN   "Heine, WIlliam" .

ocw:contributors#Heine,_William
      a       lom:Entity ;
      vc:FN   "Heine, William" .

It should be possible to write some XSLT code to fix these errors i.e. to
further improve the canonicalisation of the name. I'll take a look at it.
I'll also take a look at the N3 output to see if I can suggest any further
improvements to the transform.

kind regards,

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Friday, 24 October 2003 06:36:58 UTC