- From: Norman Walsh <ndw@nwalsh.com>
- Date: Fri, 01 May 2009 09:05:34 -0400
- To: XProc Dev <xproc-dev@w3.org>
- Message-ID: <m28wlhdlcx.fsf@nwalsh.com>
Philip Fennell <Philip.Fennell@bbc.co.uk> writes:
> That's exactly what I'm not trying to do. I'm wanting to invoke Tidy on
> an HTML document that is not well-formed XML so that I can do further
> processing on it.
I understand that use case, but I don't understand what you plan to
pass *to* your tidy:html step. The things that appear on p:input ports
*must* be well-formed XML.
> Therefore I need to use p:data to get hold of a
> non-XML document. My problem is that p:data, and p:document for that
> matter, do not allow you to use p:with-option so that you can use an
> expression (XPath) instead of a string literal (URI).
>
> I can get around the p:document problem by using p:load, but I cannot
> see an equivalent for this particular use-case; and I imagine it will be
> a popular use-case if people want to use XProc to build legacy-content
> conversion pipelines where they may have large amounts of pages in
> HTML/SGML or do screen-scrapping off of existing web sites.
The workaround is p:http-request, even though that's inelegant in some
ways. From 2.2.2 Non-XML Documents:
It is not a goal of XProc that it should be a general-purpose
pipeline language for manipulating arbitrary, non-XML resources.
There are two standard ways that a non-XML document may enter a
pipeline: directly through p:data or as the result of performing an
p:http-request step. Loading non-XML data with a computed URI
requires the p:http-request step. Implementors are encouraged to
support the file: URI scheme so that users can load local data from
computed URIs.
So, if you have the computed URI of a document in $uri, you can load
it with p:http-request:
<p:http-request method="get">
<p:with-option name="href" select="$uri"/>
</p:http-request>
Implementors are encouraged to make that work for file: URIs as well
as http(s): URIs. XML Calabash supports it.
Be seeing you,
norm
P.S. What, you may ask, possessed the WG to use p:*HTTP*-request to
load URIs from file: URIs? On the one hand, adding another step to
load from file: URIs would have largely reproduced the p:http-request
step, and on the other, there wasn't any obviously better name for
p:http-request.
--
Norman Walsh <ndw@nwalsh.com> | One stops being a child when one
http://nwalsh.com/ | realizes that telling one's trouble
| does not make it better.--Cesare Pavese
Received on Friday, 1 May 2009 13:06:18 UTC