- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Fri, 01 May 2009 13:15:39 +0100
- To: "Philip Fennell" <Philip.Fennell@bbc.co.uk>
- Cc: "XProc Dev" <xproc-dev@w3.org>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Philip Fennell writes:
> Thanks Norm, but I don't hink that helps.
>
>> If you're reading a document flowing through a pipeline, then it is
> XML.
>
> That's exactly what I'm not trying to do. I'm wanting to invoke Tidy on
> an HTML document that is not well-formed XML so that I can do further
> processing on it. Therefore I need to use p:data to get hold of a
> non-XML document.
An alternative for this, and as you point out other similar
up-translation/input coversion pipelines, is to define a script which
calls wget/curl/your-choice and pipes the result to tidy, along the
lines of
fetch-and-tidy.sh:
#!/bin/sh
uri=$1
shift
wget --output-document - "$uri" 2>/dev/null | tidy "$@"
fetch-and-tidy.bat:
@echo off
set file=%1
shift
wget --output-document - %file% 2>NUL: | tidy %1 %2 %3 %4 %5 %6 %7 %8 %9"
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc"
xmlns:my="http://www.ltg.ed.ac.uk/~ht/">
<p:declare-step name="fetch-and-tidy" type="my:tidy">
<p:option name="href"/>
<p:output port="result" primary="true"/>
<p:exec command="fetch-and-tidy.bat" source-is-xml="false"
result-is-xml="true" wrap-result-lines="false" name="ft">
<p:with-option name="args" select="concat('"',$href,'" -asxml --quiet yes --show-warnings no --doctype omit --numeric-entities yes --output-xml yes')">
<p:empty/>
</p:with-option>
<p:input port="source">
<p:empty/>
</p:input>
</p:exec>
<p:unwrap match="c:result"/>
</p:declare-step>
<my:tidy href="http://www.ltg.ed.ac.uk/~ht/xx.html"/>
</p:pipeline>
The above works in Calabash 0.9.9
Note that in any case you need to tweak your pipeline a bit from where
you and Norm left it, to get the xmlness of things accurately
reflected. The following works in Calabash 0.9.9:
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc"
xmlns:my="http://www.ltg.ed.ac.uk/~ht/">
<p:declare-step type="my:tidy">
<p:input port="source"/>
<p:output port="result"/>
<p:exec command="tidy"
source-is-xml="false"
result-is-xml="true"
wrap-result-lines="false">
<p:with-option name="args" select="'-asxml --quiet yes --show-warnings no --doctype omit --numeric-entities yes --output-xml yes'"/>
</p:exec>
<p:unwrap match="c:result"/>
</p:declare-step>
<my:tidy>
<p:input port="source">
<p:data href="http://www.ltg.ed.ac.uk/~ht/xx.html"/>
</p:input>
</my:tidy>
</p:pipeline>
ht
- --
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
iD8DBQFJ+ufskjnJixAXWBoRAp3sAKCA85FaAoslPBpqcQBvi0PCRuRNWgCcCuEw
BOOYaFTWQNCluPfeEy15f/Y=
=o/Gz
-----END PGP SIGNATURE-----
Received on Friday, 1 May 2009 12:17:19 UTC