W3C home > Mailing lists > Public > xproc-dev@w3.org > March 2012

inelegant but working pipeline seeks elegance

From: Conal Tuohy <conal.tuohy@versi.edu.au>
Date: Mon, 12 Mar 2012 17:17:44 +1100
Message-ID: <4F5D9508.6050502@versi.edu.au>
To: XProc Dev <xproc-dev@w3.org>
I have today written an XProc pipeline to harvest metadata from an 
OAI-PMH server. The pipeline works fine, but it includes some odd 
features which I would like to be able to remove, if I could see how. I 
would appreciate any advice or criticism. I find I'm having to do this 
kind of workaround too often in XProc, and I wonder if I'm just not 
approaching the task correctly?

For those who don't know OAI-PMH (Open Archives Initiative Protocol for 
Metadata Harvesting), it's an HTTP-based protocol for bulk transfer of 
metadata records in (any) XML format. The protocol includes a number of 
"verbs" or operations, only one of which ("ListRecords") is used in this 
code. The initial ListRecords query specifies a metadata format (=an 
alias for an XML schema), for which the server returns an XML document 
containing a list of metadata records, each with a block of header data 
and a payload in the specified format. The protocol includes a mechanism 
for paginating results. If the server decides to return a partial 
result, it will include a <resumptionToken> element which the OAI-PMH 
client ("harvester") can then use to issue another query which will 
resume where the first page of results left off. The last page in a 
multi-page response will include an empty <resumptionToken> element to 
signal that there are no more records available. The pipeline below will 
issue 4 such queries and retrieve about 3MB of XML in 58 files.

For example, see: 
which returns a resumption token which can then be used to make another 
query, like so: 

The use case here is to harvest a selection of the records returned from 
the OAI-PMH server. Depending on the name of the XML element in each 
metadata record, I either write the payload to a file, or else ignore it.

The pipeline follows. Any criticism is welcome, but specifically I 
wonder if it's possible to avoid:

1) The p:identity within 
2) The p:identity[@name='dummy1'] and p:sink[@name='dummy2']

The second issue I find more irritating. Because the p:otherwise can't 
be empty, I have to put some step into it, but because the p:when above 
it doesn't produce output, I can't have a step that produces output in 
the p:otherwise. So I create some output and discard it.
Thanks in advance!


PS feel free to run the pipeline

<?xml version="1.0"?>


<p:declare-step type="harvester:list-records" name="list-records">
<p:option name="base-uri" required="true"/>
<p:option name="base-storage-uri" required="true"/>
<p:option name="metadata-prefix"/>
<p:option name="resumption-token"/>
<p:variable name="uri" select="concat($base-uri, 
<!-- query the OAI-PMH server -->
<p:choose name="resumption-token-or-metadata-prefix">
<p:when test="p:value-available('resumption-token')">
<p:load name="query-with-resumption-token">
<p:with-option name="href" select="concat($uri, 'resumptionToken=', 
<p:load name="query-with-metadata-prefix">
<p:with-option name="href" select="concat($uri, 'metadataPrefix=', 
<p:identity name="query-results"/>
<p:viewport name="record" match="oai:OAI-PMH/oai:ListRecords/oai:record">
<!-- double-encode the record identifier so as to produce URI-encoded 
file name -->
<p:variable name="file-name" 
<!-- check what type of record we've harvested; we only want to save 
certain types of record -->
<p:variable name="type" select="local-name(oai:record/oai:metadata/*)"/>
<p:choose name="whether-to-save-record">
<p:when test="($type='dataset') or ($type='project') or ($type='person') 
or ($type='institution') or ($type='image') or ($type='sensoryData')">
<!-- save the record -->
<p:store name="save-record" indent="true">
<p:input port="source" select="oai:record/oai:metadata/*"/>
<p:with-option name="href" select="concat($base-storage-uri, $file-name)"/>
<p:input port="source">
<p:pipe step="save-record" port="result"/>
<!-- ignore the record -->

<!-- query again (recursively) if a resumptionToken was returned in the 
query results -->
<p:choose name="whether-to-resume-query">
<p:variable name="resumption-token-returned" 
<p:when test="normalize-space($resumption-token-returned)">
<p:with-option name="base-uri" select="$base-uri"/>
<p:with-option name="base-storage-uri" select="$base-storage-uri"/>
<p:with-option name="resumption-token" select="$resumption-token-returned"/>
<!-- no resumption token (or empty resumption token) returned; nothing 
to do -->
<p:identity name="dummy1"/>
<p:sink name="dummy2"/>



Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
Received on Monday, 12 March 2012 06:18:22 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:17:02 UTC