Re: How do you pass step options to p:data/@href???

Philip Fennell <Philip.Fennell@bbc.co.uk> writes:
> That's exactly what I'm not trying to do. I'm wanting to invoke Tidy on
> an HTML document that is not well-formed XML so that I can do further
> processing on it.

I understand that use case, but I don't understand what you plan to
pass *to* your tidy:html step. The things that appear on p:input ports
*must* be well-formed XML.

> Therefore I need to use p:data to get hold of a
> non-XML document. My problem is that p:data, and p:document for that
> matter, do not allow you to use p:with-option so that you can use an
> expression (XPath) instead of a string literal (URI).
>
> I can get around the p:document problem by using p:load, but I cannot
> see an equivalent for this particular use-case; and I imagine it will be
> a popular use-case if people want to use XProc to build legacy-content
> conversion pipelines where they may have large amounts of pages in
> HTML/SGML or do screen-scrapping off of existing web sites.

The workaround is p:http-request, even though that's inelegant in some
ways. From 2.2.2 Non-XML Documents:

  It is not a goal of XProc that it should be a general-purpose
  pipeline language for manipulating arbitrary, non-XML resources.

  There are two standard ways that a non-XML document may enter a
  pipeline: directly through p:data or as the result of performing an
  p:http-request step. Loading non-XML data with a computed URI
  requires the p:http-request step. Implementors are encouraged to
  support the file: URI scheme so that users can load local data from
  computed URIs.

So, if you have the computed URI of a document in $uri, you can load
it with p:http-request:

  <p:http-request method="get">
    <p:with-option name="href" select="$uri"/>
  </p:http-request>

Implementors are encouraged to make that work for file: URIs as well
as http(s): URIs. XML Calabash supports it.

                                        Be seeing you,
                                          norm

P.S. What, you may ask, possessed the WG to use p:*HTTP*-request to
load URIs from file: URIs? On the one hand, adding another step to
load from file: URIs would have largely reproduced the p:http-request
step, and on the other, there wasn't any obviously better name for
p:http-request.

-- 
Norman Walsh <ndw@nwalsh.com> | One stops being a child when one
http://nwalsh.com/            | realizes that telling one's trouble
                              | does not make it better.--Cesare Pavese

Received on Friday, 1 May 2009 13:06:18 UTC