- From: Conal Tuohy <conal.tuohy@versi.edu.au>
- Date: Mon, 12 Mar 2012 17:17:44 +1100
- To: XProc Dev <xproc-dev@w3.org>
I have today written an XProc pipeline to harvest metadata from an
OAI-PMH server. The pipeline works fine, but it includes some odd
features which I would like to be able to remove, if I could see how. I
would appreciate any advice or criticism. I find I'm having to do this
kind of workaround too often in XProc, and I wonder if I'm just not
approaching the task correctly?
For those who don't know OAI-PMH (Open Archives Initiative Protocol for
Metadata Harvesting), it's an HTTP-based protocol for bulk transfer of
metadata records in (any) XML format. The protocol includes a number of
"verbs" or operations, only one of which ("ListRecords") is used in this
code. The initial ListRecords query specifies a metadata format (=an
alias for an XML schema), for which the server returns an XML document
containing a list of metadata records, each with a block of header data
and a payload in the specified format. The protocol includes a mechanism
for paginating results. If the server decides to return a partial
result, it will include a <resumptionToken> element which the OAI-PMH
client ("harvester") can then use to issue another query which will
resume where the first page of results left off. The last page in a
multi-page response will include an empty <resumptionToken> element to
signal that there are no more records available. The pipeline below will
issue 4 such queries and retrieve about 3MB of XML in 58 files.
For example, see:
<http://andsdb-sc18-test.latrobe.edu.au/oai-pmh/oai?verb=ListRecords&metadataPrefix=tdar>
which returns a resumption token which can then be used to make another
query, like so:
<http://andsdb-sc18-test.latrobe.edu.au/oai-pmh/oai?verb=ListRecords&resumptionToken=10,1900,3000,tdar>
The use case here is to harvest a selection of the records returned from
the OAI-PMH server. Depending on the name of the XML element in each
metadata record, I either write the payload to a file, or else ignore it.
The pipeline follows. Any criticism is welcome, but specifically I
wonder if it's possible to avoid:
1) The p:identity within
p:choose[@name='whether-to-save-record']/p:otherwise
2) The p:identity[@name='dummy1'] and p:sink[@name='dummy2']
The second issue I find more irritating. Because the p:otherwise can't
be empty, I have to put some step into it, but because the p:when above
it doesn't produce output, I can't have a step that produces output in
the p:otherwise. So I create some output and discard it.
Thanks in advance!
Conal
PS feel free to run the pipeline
<?xml version="1.0"?>
<p:declare-step
version="1.0"
xmlns:p="http://www.w3.org/ns/xproc"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:harvester="tag:conal.tuohy@versi.edu.au,2012:oai-pmh"
xmlns:oai="http://www.openarchives.org/OAI/2.0/">
<harvester:list-records
name="harvest"
base-uri="http://ahad.edu.au/oai-pmh/oai"
base-storage-uri="/tmp/harvest/"
metadata-prefix="tdar"/>
<p:declare-step type="harvester:list-records" name="list-records">
<p:option name="base-uri" required="true"/>
<p:option name="base-storage-uri" required="true"/>
<p:option name="metadata-prefix"/>
<p:option name="resumption-token"/>
<p:variable name="uri" select="concat($base-uri,
'?verb=ListRecords&')"/>
<!-- query the OAI-PMH server -->
<p:choose name="resumption-token-or-metadata-prefix">
<p:when test="p:value-available('resumption-token')">
<p:load name="query-with-resumption-token">
<p:with-option name="href" select="concat($uri, 'resumptionToken=',
$resumption-token)"/>
</p:load>
</p:when>
<p:otherwise>
<p:load name="query-with-metadata-prefix">
<p:with-option name="href" select="concat($uri, 'metadataPrefix=',
$metadata-prefix)"/>
</p:load>
</p:otherwise>
</p:choose>
<p:identity name="query-results"/>
<p:viewport name="record" match="oai:OAI-PMH/oai:ListRecords/oai:record">
<!-- double-encode the record identifier so as to produce URI-encoded
file name -->
<p:variable name="file-name"
select="concat(fn:encode-for-uri(fn:encode-for-uri(/oai:record/oai:header/oai:identifier)),
'.xml')"/>
<!-- check what type of record we've harvested; we only want to save
certain types of record -->
<p:variable name="type" select="local-name(oai:record/oai:metadata/*)"/>
<p:choose name="whether-to-save-record">
<p:when test="($type='dataset') or ($type='project') or ($type='person')
or ($type='institution') or ($type='image') or ($type='sensoryData')">
<!-- save the record -->
<p:store name="save-record" indent="true">
<p:input port="source" select="oai:record/oai:metadata/*"/>
<p:with-option name="href" select="concat($base-storage-uri, $file-name)"/>
</p:store>
<p:identity>
<p:input port="source">
<p:pipe step="save-record" port="result"/>
</p:input>
</p:identity>
</p:when>
<p:otherwise>
<!-- ignore the record -->
<p:identity/>
</p:otherwise>
</p:choose>
</p:viewport>
<!-- query again (recursively) if a resumptionToken was returned in the
query results -->
<p:choose name="whether-to-resume-query">
<p:variable name="resumption-token-returned"
select="/oai:OAI-PMH/oai:ListRecords/oai:resumptionToken"/>
<p:when test="normalize-space($resumption-token-returned)">
<harvester:list-records>
<p:with-option name="base-uri" select="$base-uri"/>
<p:with-option name="base-storage-uri" select="$base-storage-uri"/>
<p:with-option name="resumption-token" select="$resumption-token-returned"/>
</harvester:list-records>
</p:when>
<p:otherwise>
<!-- no resumption token (or empty resumption token) returned; nothing
to do -->
<p:identity name="dummy1"/>
<p:sink name="dummy2"/>
</p:otherwise>
</p:choose>
</p:declare-step>
</p:declare-step>
--
Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
+61-466324297
Received on Monday, 12 March 2012 06:18:22 UTC