- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Fri, 24 Oct 2003 11:27:56 +0100
- To: SIMILE public list <www-rdf-dspace@w3.org>
Hi Kevin Thanks for doing this. > I don't think I've mentioned xmlstarlet before, but it has a pretty > nice command-line syntax, and is basically an XML swiss-army knife of > a tool. Yes I've seen this, but not used it, it seemed it had good EXSLT support (EXSLT is like a forerunner of XSLT 2.0, it is a set of extensions developed by XML developers). My first step was to try converting your RDF files to N3 because I find it easier to examine them. However this hit some snags: 1. Jena didn't like filenames with spaces in, so I removed all the whitespaces, then reloaded the files into the CVS. 2. I'm afraid I didn't fancy compiling XMLStarlet for Cygwin, so I adapted your script to do the transforms using Saxon. 3. To make the RDF more output, I transformed the RDF files to N3 using Jena e.g. for i in RDF/*.rdf do echo $i java -cp "$CP" jena.rdfcopy "$i" RDF/XML N3 >"N3/`basename \"$i\" .rdf`.n3" done 4. Loading the files into Jena identified some syntatic errors in the RDF/XML output by the XSLT stylesheet, mainly unqualified uses of rdf:about and non-XML names e.g. WARN [main] (RDFDefaultErrorHandler.java:29) - file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng ineeringProblemSolving.rdf[3:182]: {W101} Unqualified use of rdf:about has been deprecated. WARN [main] (RDFDefaultErrorHandler.java:29) - file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng ineeringProblemSolving.rdf[24:43]: {W108} Not an XML Name: 'Judson Harward' WARN [main] (RDFDefaultErrorHandler.java:29) - file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng ineeringProblemSolving.rdf[29:42]: {W108} Not an XML Name: 'Steven Lerman' WARN [main] (RDFDefaultErrorHandler.java:29) - file:///C:/jcvs/simile4/simile/corpus/ims/OCW/RDF/1_00IntroToComputersAndEng ineeringProblemSolving.rdf[36:236]: {W101} Unqualified use of rdf:about has been deprecated. etc So changed line 48 of template.xsl from name="about" to name="rdf:about", ditto line 118 and ditto line 136. I've committed template.xsl back to the CVS with these problems fixed. The next step was to fix the XML names problem, this was happening due to RDF like this <lom-life:author> <lom:Entity rdf:ID="Judson Harward"> <vc:FN>Judson Harward</vc:FN> </lom:Entity> </lom-life:author> <dc:contributor> <lom:Entity rdf:ID="Steven Lerman"> <vc:FN>Steven Lerman</vc:FN> </lom:Entity> which is being generated by this code in the stylesheet <xsl:template name="contrib"> <lom:Entity> <xsl:attribute name="rdf:ID"> <xsl:value-of select="Entity"/> </xsl:attribute> <vc:FN><xsl:value-of select="Entity"/></vc:FN> </lom:Entity> </xsl:template> In the artstor transform I have solved this using replace, but unfortunately this is an XSLT 2.0 feature so I don't know if XMLStarlet will support it - e.g. <xsl:function name="str:urlencode"> <xsl:param name="url"/> <xsl:value-of select="replace(replace(normalize-space($url),': ','_'),' ','_')"/> </xsl:function> <xsl:template name="contrib"> <lom:Entity> <xsl:attribute name="rdf:about">&ocw;contributors#<xsl:value-of select="str:urlencode(Entity)"/></xsl:attribute> <vc:FN><xsl:value-of select="Entity"/></vc:FN> </lom:Entity> </xsl:template> note this uses a user defined function in the str namespace so you need to define this at the top of stylesheet. I've committed this revised stylesheet to CVS, called templateSaxon.xsl. This generates names of the form <lom-life:author> <lom:Entity rdf:about="http://ocw.mit.edu/contributors#Judson_Harward"> <vc:FN>Judson Harward</vc:FN> </lom:Entity> </lom-life:author> <dc:contributor> <lom:Entity rdf:about="http://ocw.mit.edu/contributors#Steven_Lerman"> <vc:FN>Steven Lerman</vc:FN> </lom:Entity> </dc:contributor> So it does generate unique URIs for people. Now of course we still have the problem that these URIs are collection dependent, e.g. we might have a URI http://ocw.mit.edu/contributors#Judson_Harward but also a URI http://web.mit.edu/simile/metadata/artstor/person#Harward,_Judson but I think that is okay for now. Our main aim in converting the XML to RDF is to get rid of duplicates within this collection, removing inter-collection duplicates will come later. If you look at the N3, then you see this collapses the names except where data entry has been inconsistent. For example if you look at N3\21F_027J_VisualizingCultures.n3 you will see some names are encoded surname, forename whereas others are encoded forename surename e.g. ocw:contributors#Miyagawa,_Shigeru a lom:Entity ; vc:FN "Miyagawa, Shigeru " . ocw:contributors#Shigeru_Miyagawa a lom:Entity ; vc:FN "Shigeru Miyagawa" . In other names there are differences in capitalization ocw:contributors#Heine,_WIlliam a lom:Entity ; vc:FN "Heine, WIlliam" . ocw:contributors#Heine,_William a lom:Entity ; vc:FN "Heine, William" . It should be possible to write some XSLT code to fix these errors i.e. to further improve the canonicalisation of the name. I'll take a look at it. I'll also take a look at the N3 output to see if I can suggest any further improvements to the transform. kind regards, Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Friday, 24 October 2003 06:36:58 UTC