W3C home > Mailing lists > Public > xproc-dev@w3.org > March 2012

inelegant but working pipeline seeks elegance

From: Conal Tuohy <conal.tuohy@versi.edu.au>
Date: Mon, 12 Mar 2012 17:17:44 +1100
Message-ID: <4F5D9508.6050502@versi.edu.au>
To: XProc Dev <xproc-dev@w3.org>
I have today written an XProc pipeline to harvest metadata from an 
OAI-PMH server. The pipeline works fine, but it includes some odd 
features which I would like to be able to remove, if I could see how. I 
would appreciate any advice or criticism. I find I'm having to do this 
kind of workaround too often in XProc, and I wonder if I'm just not 
approaching the task correctly?

For those who don't know OAI-PMH (Open Archives Initiative Protocol for 
Metadata Harvesting), it's an HTTP-based protocol for bulk transfer of 
metadata records in (any) XML format. The protocol includes a number of 
"verbs" or operations, only one of which ("ListRecords") is used in this 
code. The initial ListRecords query specifies a metadata format (=an 
alias for an XML schema), for which the server returns an XML document 
containing a list of metadata records, each with a block of header data 
and a payload in the specified format. The protocol includes a mechanism 
for paginating results. If the server decides to return a partial 
result, it will include a <resumptionToken> element which the OAI-PMH 
client ("harvester") can then use to issue another query which will 
resume where the first page of results left off. The last page in a 
multi-page response will include an empty <resumptionToken> element to 
signal that there are no more records available. The pipeline below will 
issue 4 such queries and retrieve about 3MB of XML in 58 files.

For example, see: 
<http://andsdb-sc18-test.latrobe.edu.au/oai-pmh/oai?verb=ListRecords&metadataPrefix=tdar> 
which returns a resumption token which can then be used to make another 
query, like so: 
<http://andsdb-sc18-test.latrobe.edu.au/oai-pmh/oai?verb=ListRecords&resumptionToken=10,1900,3000,tdar>

The use case here is to harvest a selection of the records returned from 
the OAI-PMH server. Depending on the name of the XML element in each 
metadata record, I either write the payload to a file, or else ignore it.

The pipeline follows. Any criticism is welcome, but specifically I 
wonder if it's possible to avoid:

1) The p:identity within 
p:choose[@name='whether-to-save-record']/p:otherwise
2) The p:identity[@name='dummy1'] and p:sink[@name='dummy2']

The second issue I find more irritating. Because the p:otherwise can't 
be empty, I have to put some step into it, but because the p:when above 
it doesn't produce output, I can't have a step that produces output in 
the p:otherwise. So I create some output and discard it.
Thanks in advance!

Conal

PS feel free to run the pipeline


<?xml version="1.0"?>
<p:declare-step
     version="1.0"
     xmlns:p="http://www.w3.org/ns/xproc"
     xmlns:fn="http://www.w3.org/2005/xpath-functions"
     xmlns:harvester="tag:conal.tuohy@versi.edu.au,2012:oai-pmh"
     xmlns:oai="http://www.openarchives.org/OAI/2.0/">

<harvester:list-records
         name="harvest"
         base-uri="http://ahad.edu.au/oai-pmh/oai"
         base-storage-uri="/tmp/harvest/"
         metadata-prefix="tdar"/>

<p:declare-step type="harvester:list-records" name="list-records">
<p:option name="base-uri" required="true"/>
<p:option name="base-storage-uri" required="true"/>
<p:option name="metadata-prefix"/>
<p:option name="resumption-token"/>
<p:variable name="uri" select="concat($base-uri, 
'?verb=ListRecords&amp;')"/>
<!-- query the OAI-PMH server -->
<p:choose name="resumption-token-or-metadata-prefix">
<p:when test="p:value-available('resumption-token')">
<p:load name="query-with-resumption-token">
<p:with-option name="href" select="concat($uri, 'resumptionToken=', 
$resumption-token)"/>
</p:load>
</p:when>
<p:otherwise>
<p:load name="query-with-metadata-prefix">
<p:with-option name="href" select="concat($uri, 'metadataPrefix=', 
$metadata-prefix)"/>
</p:load>
</p:otherwise>
</p:choose>
<p:identity name="query-results"/>
<p:viewport name="record" match="oai:OAI-PMH/oai:ListRecords/oai:record">
<!-- double-encode the record identifier so as to produce URI-encoded 
file name -->
<p:variable name="file-name" 
select="concat(fn:encode-for-uri(fn:encode-for-uri(/oai:record/oai:header/oai:identifier)), 
'.xml')"/>
<!-- check what type of record we've harvested; we only want to save 
certain types of record -->
<p:variable name="type" select="local-name(oai:record/oai:metadata/*)"/>
<p:choose name="whether-to-save-record">
<p:when test="($type='dataset') or ($type='project') or ($type='person') 
or ($type='institution') or ($type='image') or ($type='sensoryData')">
<!-- save the record -->
<p:store name="save-record" indent="true">
<p:input port="source" select="oai:record/oai:metadata/*"/>
<p:with-option name="href" select="concat($base-storage-uri, $file-name)"/>
</p:store>
<p:identity>
<p:input port="source">
<p:pipe step="save-record" port="result"/>
</p:input>
</p:identity>
</p:when>
<p:otherwise>
<!-- ignore the record -->
<p:identity/>
</p:otherwise>
</p:choose>
</p:viewport>

<!-- query again (recursively) if a resumptionToken was returned in the 
query results -->
<p:choose name="whether-to-resume-query">
<p:variable name="resumption-token-returned" 
select="/oai:OAI-PMH/oai:ListRecords/oai:resumptionToken"/>
<p:when test="normalize-space($resumption-token-returned)">
<harvester:list-records>
<p:with-option name="base-uri" select="$base-uri"/>
<p:with-option name="base-storage-uri" select="$base-storage-uri"/>
<p:with-option name="resumption-token" select="$resumption-token-returned"/>
</harvester:list-records>
</p:when>
<p:otherwise>
<!-- no resumption token (or empty resumption token) returned; nothing 
to do -->
<p:identity name="dummy1"/>
<p:sink name="dummy2"/>
</p:otherwise>
</p:choose>

</p:declare-step>

</p:declare-step>

-- 
Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
+61-466324297
Received on Monday, 12 March 2012 06:18:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 12 March 2012 06:18:23 GMT