- From: Conal Tuohy <conal.tuohy@versi.edu.au>
- Date: Mon, 12 Mar 2012 17:17:44 +1100
- To: XProc Dev <xproc-dev@w3.org>
I have today written an XProc pipeline to harvest metadata from an OAI-PMH server. The pipeline works fine, but it includes some odd features which I would like to be able to remove, if I could see how. I would appreciate any advice or criticism. I find I'm having to do this kind of workaround too often in XProc, and I wonder if I'm just not approaching the task correctly? For those who don't know OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), it's an HTTP-based protocol for bulk transfer of metadata records in (any) XML format. The protocol includes a number of "verbs" or operations, only one of which ("ListRecords") is used in this code. The initial ListRecords query specifies a metadata format (=an alias for an XML schema), for which the server returns an XML document containing a list of metadata records, each with a block of header data and a payload in the specified format. The protocol includes a mechanism for paginating results. If the server decides to return a partial result, it will include a <resumptionToken> element which the OAI-PMH client ("harvester") can then use to issue another query which will resume where the first page of results left off. The last page in a multi-page response will include an empty <resumptionToken> element to signal that there are no more records available. The pipeline below will issue 4 such queries and retrieve about 3MB of XML in 58 files. For example, see: <http://andsdb-sc18-test.latrobe.edu.au/oai-pmh/oai?verb=ListRecords&metadataPrefix=tdar> which returns a resumption token which can then be used to make another query, like so: <http://andsdb-sc18-test.latrobe.edu.au/oai-pmh/oai?verb=ListRecords&resumptionToken=10,1900,3000,tdar> The use case here is to harvest a selection of the records returned from the OAI-PMH server. Depending on the name of the XML element in each metadata record, I either write the payload to a file, or else ignore it. The pipeline follows. Any criticism is welcome, but specifically I wonder if it's possible to avoid: 1) The p:identity within p:choose[@name='whether-to-save-record']/p:otherwise 2) The p:identity[@name='dummy1'] and p:sink[@name='dummy2'] The second issue I find more irritating. Because the p:otherwise can't be empty, I have to put some step into it, but because the p:when above it doesn't produce output, I can't have a step that produces output in the p:otherwise. So I create some output and discard it. Thanks in advance! Conal PS feel free to run the pipeline <?xml version="1.0"?> <p:declare-step version="1.0" xmlns:p="http://www.w3.org/ns/xproc" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:harvester="tag:conal.tuohy@versi.edu.au,2012:oai-pmh" xmlns:oai="http://www.openarchives.org/OAI/2.0/"> <harvester:list-records name="harvest" base-uri="http://ahad.edu.au/oai-pmh/oai" base-storage-uri="/tmp/harvest/" metadata-prefix="tdar"/> <p:declare-step type="harvester:list-records" name="list-records"> <p:option name="base-uri" required="true"/> <p:option name="base-storage-uri" required="true"/> <p:option name="metadata-prefix"/> <p:option name="resumption-token"/> <p:variable name="uri" select="concat($base-uri, '?verb=ListRecords&')"/> <!-- query the OAI-PMH server --> <p:choose name="resumption-token-or-metadata-prefix"> <p:when test="p:value-available('resumption-token')"> <p:load name="query-with-resumption-token"> <p:with-option name="href" select="concat($uri, 'resumptionToken=', $resumption-token)"/> </p:load> </p:when> <p:otherwise> <p:load name="query-with-metadata-prefix"> <p:with-option name="href" select="concat($uri, 'metadataPrefix=', $metadata-prefix)"/> </p:load> </p:otherwise> </p:choose> <p:identity name="query-results"/> <p:viewport name="record" match="oai:OAI-PMH/oai:ListRecords/oai:record"> <!-- double-encode the record identifier so as to produce URI-encoded file name --> <p:variable name="file-name" select="concat(fn:encode-for-uri(fn:encode-for-uri(/oai:record/oai:header/oai:identifier)), '.xml')"/> <!-- check what type of record we've harvested; we only want to save certain types of record --> <p:variable name="type" select="local-name(oai:record/oai:metadata/*)"/> <p:choose name="whether-to-save-record"> <p:when test="($type='dataset') or ($type='project') or ($type='person') or ($type='institution') or ($type='image') or ($type='sensoryData')"> <!-- save the record --> <p:store name="save-record" indent="true"> <p:input port="source" select="oai:record/oai:metadata/*"/> <p:with-option name="href" select="concat($base-storage-uri, $file-name)"/> </p:store> <p:identity> <p:input port="source"> <p:pipe step="save-record" port="result"/> </p:input> </p:identity> </p:when> <p:otherwise> <!-- ignore the record --> <p:identity/> </p:otherwise> </p:choose> </p:viewport> <!-- query again (recursively) if a resumptionToken was returned in the query results --> <p:choose name="whether-to-resume-query"> <p:variable name="resumption-token-returned" select="/oai:OAI-PMH/oai:ListRecords/oai:resumptionToken"/> <p:when test="normalize-space($resumption-token-returned)"> <harvester:list-records> <p:with-option name="base-uri" select="$base-uri"/> <p:with-option name="base-storage-uri" select="$base-storage-uri"/> <p:with-option name="resumption-token" select="$resumption-token-returned"/> </harvester:list-records> </p:when> <p:otherwise> <!-- no resumption token (or empty resumption token) returned; nothing to do --> <p:identity name="dummy1"/> <p:sink name="dummy2"/> </p:otherwise> </p:choose> </p:declare-step> </p:declare-step> -- Conal Tuohy eResearch Business Analyst Victorian eResearch Strategic Initiative +61-466324297
Received on Monday, 12 March 2012 06:18:22 UTC